Lecture 3: Linear Penalization (SUGGESTED SOLUTIONS)

This notebook begins our foray into machine learning. Our goals here are modest. We would like to:

learn a bit about how machine learning is similar to, and different from, econometrics.
introduce the scikit-learn package which is chock full of 'machine learning' tools.
work on some validation methods, which are an important part of the machine learning toolkit.
explore popular linear penalization models: the ridge, lasso, and elastic-net regression models

We've done (1) and (2) so we'll focus on (3) and (4) in this notebook. These are methods that put discipline on the importance of the independent variables of the regression.

The agenda for today's lecture is as follows:

Reminder: The Bias-Variance Trade-off

Practical Set Validation
- Variable Selection with OLS

Shrinkage Methods
- Ridge Regression
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Elastic Net

Class Announcements

None.

1. Reminder: The Bias-Variance Trade-off (top)¶

Our objective in ML is to fit a model

$$ y = f(X) + \epsilon.$$

such that when we get new data ($X$) we can make good predictions $(\hat y)$. Doing this requires specification of $f()$. As we discussed before, the fundamental constraint in machine learning is the bias-variance tradeoff; ie, our estimated/fit model can be decomposed into

$$\begin{align*} \frac{1}{N}\sum_{n=1}^N(y_n-\hat{y}_n)^2 & = \left(E\left[\,\hat{f}(X)\right]-f(X) \right)^2 + E\,\left(\,\hat{f}(X)-E\left[\,\hat{f}(X)\right]\right)^2 + \text{var}(\epsilon)\\ &= \text{bias}^2 + \text{variance} + \text{var}(\epsilon).\\ \end{align*}$$

Bias and variance trade-off¶

Note that all three terms are positive. We can't do anything about the third term, it is the irreducible error. The other two terms we would like to make as small as possible. Unfortunately, shrinking one of the two terms tends to increase the other term. Roughly speaking, there is no free lunch.

The bias term tells us, on average, how close we get to the true model. Are we systematically over- or under-predicting $y$?

The variance term tells us how much our estimates vary as we use different training datasets to estimate the model.

We would like to minimize the bias and the variance of the left-hand side -- the Mean Squared Error (mse). How can we do so?

In the linear models we are considering, complexity increases as we add more variables to $X$. This includes adding polynomials of our independent variables, interactions, etc. How does complexity influence the testing error?

As we increase the complexity of our model $f(\;)$ the squared bias tends to decrease.
As we increase the complexity of our model $f(\;)$ the variance tends to increase.

This is the tradeoff. As we add features to the model, the bias decreases, but the variance increases. This gives rise to a u-shaped mse. This figure from James et al. is a good illustration.

Overfitting¶

Behind the bias-variance tradeoff is the idea of overfitting. The more complex the model, the more it will capture variation in $y$ due to the random error ($\epsilon$).

This makes the model fit the data better (lower bias). We are capturing $y$ behavior from both $f(\;)$ and $\epsilon$.

This makes the model more variable. A new set of training data will have different $\epsilon$s. The estimate will change to match these new values of $\epsilon$. Since $\epsilon$ is not related to $f(\;)$, we are making the estimate noisier.

We'll take the ying and yang of this by combining the set validation approach (training and testing) we've already seen with model shrinkage.

An Example of Our Proposed Approach¶

Remember the DGP we generated last lecture. We want a process where we not only say whether the model is good or not but we instead build the model with enough complexity / flexibility while also preventing over-fitting. The idea will be to sample from the data repeatedly an keep models for which the mean squared error is low.

Graphically, this would be like building a model which matches both samples from the DGP we simulated in the previous lecture.

2. Practical "Set Validation" (top)¶

In the last lecture we'll learned about set validation:

We break our dataset up into two subsets. Call one the training set and the other the testing set.

We estimate our model on the training data. We measure how well our model fits the data by looking at the mean squared error of the regression or, alternatively, the r-squared, which normalizes the mse by the total squared error. We call these the training mse or the training $r^2$. This is usually what we care about when we are doing inference.

Using the X data in the testing data, we use the estimated $f(\;)$ from step 2 to predict value for y

$$\hat{y} = \hat{f}(X_{test}).$$

We then compare the predicted ($\hat{y}$) values with the actual ($y$) values in the training data. Again, we can use mse or r-squared to judge how well our predicted data matched the actual data. Call this the test mse or test $r^2$. This is what we usually care about when we are interested in prediction.

You should have felt a bit unsatisfied with this approach. Yes, it enables us to ask: "Is my model good?" but "good" is not well-defined and the validation technique didn't really add any kind of test of discipline to the over-fitting problem. In the practice problems of that lecture, however, you did set validation repeatedly to see if the variance would be high. It would be nice if these repeated set validations somehow affected the choice of model parameters to fit the model to data; hence, repeated set validation would prevent over-fitting.

Today, we'll do something similar (repeated set validation) and learn how to practically implement set validation in a ML framework (i.e., the repeated validation will discipline the ML model). We'll call this repeated set validation cross-validation.

2.1 Leave one out cross-validation (loocv)¶

Leave one out cross-validation (loocv) addresses the concerns with set validation. The loocv algorithm is

For each observation in the data i = 0,...n

The test data is observation i
The training data is all the data except observation i
Estimate the model on the training data
Evaluate the model on the testing data, this generates a mean squared error associated with 'round' i

The test mse is the average over the n mses computed: $$ CV_{(n)} = \frac{1}{n}\sum_{i=1}^nmse_{i}$$

Notice that this is similar in spirit to what we were doing last class. The difference is that we are doing are repeating the estimate exactly n times and that the split into test and training sets is not random.

Let's try it out. Rather than write code to implement the algorithm above, scikit will do it for us.

First, load up some automobile data.

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:

# There are a few missing values coded as '?'. 
auto = pd.read_csv('./Data/auto.csv', na_values='?')
auto = auto.dropna()
print(auto.info())
auto.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 30.6+ KB
None

Out[3]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	1	ford torino

We will need the linear_model package to run the regression and the cross_val_score function from scikit-learn.

In [4]:

# From scikit-learn import the cross_val_score function
from sklearn.model_selection import cross_val_score

# From scikit-learn, peel off the linear_model package
from sklearn import linear_model 

The cross_val_score() function (docs) takes as arguments an estimator, in our case OLS, the x data, the y data, a scoring criterion, and the number of splits to make to the data, cv.

Two comments:

The r-squared is not a good metric of fit (score) in this case, because we are testing on a single observation. The r-squared will be zero always. Instead, we will use the mean squared error. In scikit learn, the negative of the mean squared error is returned. This is simply a convention: The package creators want all the scoring criterion to be increasing in fit. (e.g., a larger r-squared is better, a larger (closer to zero) negative mse is better).

We set the cv parameter to the number of observations in the data. We want to split the data into $n$ chunks (each chunk with one observation) and leave one chunk out each iteration.

In [5]:

scores = cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'weight']], auto['mpg'],
                         scoring='neg_mean_squared_error', cv=len(auto['mpg']))

That was easy! The cross_val_score function returns an array with the negative mse for each iteration.

In [6]:

print('Scores is of type:', type(scores))
print('Length of scores:', len(scores))

Scores is of type: <class 'numpy.ndarray'>
Length of scores: 392

To compute the test mean squared error, $$ CV_{(n)} = \frac{1}{n}\sum_{i=1}^nmse_{i}$$ we just compute the mean of scores. I will multiply by -1, so that we have the mse.

In [7]:

test_mse = scores.mean()
print('The test mse is {0:5.3f}'.format(-test_mse))

The test mse is 18.113

2.2 K-fold Cross-Validation¶

k-fold cross-validation is a generalization of loocv. An issue with loocv is that we need to estimate the model $n$ times, once for each observation. If there are a lot of data, or if the model is hard to estimate, this approach is may be computationally intensive. k-fold cross-validation is the shortcut.

The solution to this is to break the data up into k sets (or folds) of length n/k, each set containing n/k randomly assigned observations. We then proceed as in loocv where each fold is used once as a validation set while the $k-1$ remaining folds form the training set. Note that the data are split consecutively by default in sklearn but can be shuffled.

For example, if $k=10$, split the sample into five approximately equal groups. Fit the model on nine of the ten groups and use the one left-out to generate the mean square error, $mse_1$. Do this nine more times so you have $\{mse_i\}_{i=1}^{10}$.

In this procedure, the test mse is $$ CV_{(k)} = \frac{1}{k}\sum_{i=1}^k mse_{i}$$

Clearly, loocv is just k-fold cv with $k=n$.

How big should $k$ be? In practice, people tend to use 5 or 10.

[There are additional reasons dealing with bias and variance to use $k<n$, but that discussion is outside the scope of this class.]

In [8]:

# estimator = LinearRegression, # X_data = auto[['horsepower', 'weight']], y_data = auto['mpg']

scores_10 = cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'weight']], auto['mpg'],
                         scoring='neg_mean_squared_error', cv=10)

In [9]:

print(scores_10)
print('\nThe test mse is {0:5.3f}'.format(-scores_10.mean()))

[-12.08976149 -18.98987088 -25.43307056 -13.78096989  -9.1564195
  -6.60875451 -11.73671376 -16.18958302 -57.10638935 -35.20884108]

The test mse is 20.630

Practice¶

We will use k-fold cross-validation to compare models with higher-degree polynomials in horsepower.

$$\begin{align} mpg &= \beta_0 + \beta_1 hp +\epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \beta_3 hp^3 + \epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \beta_3 hp^3 + \beta_4 hp^4 + \epsilon\\ mpg &= \beta_0 + \beta_1 hp + \beta_2 hp^2 + \beta_3 hp^3 + \beta_4 hp^4 + \beta_5 hp^5 + \epsilon \end{align}$$

Create columns named 'hp2' through 'hp5' that correspond to the powers of horsepower. You can do this manually or via a loop.

In [10]:

# Manual
auto['hp2'] = auto['horsepower']**2
auto['hp3'] = auto['horsepower']**3
auto['hp4'] = auto['horsepower']**4
auto['hp5'] = auto['horsepower']**5

# With a loop
for i in range(2,6):
    auto['hp' + str(i)] = auto['horsepower']**i
    
auto.head()

Out[10]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name	hp2	hp3	hp4	hp5
0	18.0	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu	16900.0	2197000.0	285610000.0	3.712930e+10
1	15.0	8	350.0	165.0	3693	11.5	70	1	buick skylark 320	27225.0	4492125.0	741200625.0	1.222981e+11
2	18.0	8	318.0	150.0	3436	11.0	70	1	plymouth satellite	22500.0	3375000.0	506250000.0	7.593750e+10
3	16.0	8	304.0	150.0	3433	12.0	70	1	amc rebel sst	22500.0	3375000.0	506250000.0	7.593750e+10
4	17.0	8	302.0	140.0	3449	10.5	70	1	ford torino	19600.0	2744000.0	384160000.0	5.378240e+10

Compute the k-fold validation mse with k=10 $$ CV_{(k)} = \frac{1}{k}\sum_{i=1}^k mse_{i}$$ for each of the models. Save the mse for each model in a list.

In [11]:

test_mse = []  # Initialize empty list to record values

score = cross_val_score(linear_model.LinearRegression(), auto[['horsepower']], auto['mpg'],
                        scoring='neg_mean_squared_error', cv=10)
test_mse.append(score.mean())  # calculate mean and add to list


test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2']], auto['mpg'],
                         scoring='neg_mean_squared_error', cv=10).mean() )

test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2', 'hp3']], auto['mpg'],
                         scoring='neg_mean_squared_error', cv=10).mean() )

test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2', 'hp3', 'hp4']], auto['mpg'],
                         scoring='neg_mean_squared_error', cv=10).mean() )

test_mse.append(cross_val_score(linear_model.LinearRegression(), auto[['horsepower', 'hp2', 'hp3', 'hp4', 'hp5']], auto['mpg'],
                         scoring='neg_mean_squared_error', cv=10).mean() )

# The score function reports negative mse, convert to positive
test_mse = [-1*i for i in test_mse]
print(test_mse)

[27.439933652339867, 21.23584005580223, 21.33660618322732, 21.353886981375943, 20.905640766798633]

Plot the mse (not the negative mse) from each model on the y-axis of a scatter plot, with [1, 2, 3, 4, 5] on the x-axis. Which degree of polynomial would you include in your model?

In [12]:

fig, ax = plt.subplots(figsize=(15,6))

ax.plot([1,2,3,4,5], test_mse, color='black')
sns.despine(ax=ax)

# Only have ticks on the integers
ax.set_xticks([1,2,3,4,5])

ax.set_title('MSE for models with different powers of horsepower',size=24)
ax.set_xlabel('degree  polynomial of horsepower',size=18)
ax.set_ylabel('mean squared error',size=18)

ax.set_ylim(18,28)

plt.show()

3. Shrinkage Methods (top)¶

The bias-variance tradeoff says that we want to constrain our model's complexity. We don't want the model to memorize the training dataset, we want a model that generalizes well to new, unseen data. There are different ways to do this. For linear models, we do this by penalizing "large" $\beta$ coefficients in order to prevent that our model from picking up "peculiarities" or "noise" in the training data.

In econometrics, an uninformative independent variable may still have a large though insignificant (ie, large standard error) coefficient. We don't have standard errors in ML so we need some way of converting the large standard error into the coefficient. This process is known as regularization and amounts to adding a penalty to our minimization function to our minimization routine:

$$ \begin{align*} \hat\beta=\min_{\beta}& \Big[\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2\Big] + \text{Penalty}(\beta) \end{align*} $$

In intuitive terms, we can think of regularization as a penalty against complexity. Increasing the regularization strength penalizes "large" weight coefficients. Can a coefficient still be "large"? Yes, but only if it does a good job of explaining the data -- it does a good job of minimizing $\sum_{i=1}^n(y_i-\hat{y}_i)^2$.

Two common and easy to grasp methods are the ridge and the lasso regression. We'll add to that the chameleon regression known as elastic net which nests ridge, lasso, and OLS. These models differ only the form of "Penalty($\beta$)" they use.

Let's see how they work.

In [13]:

import pandas as pd                 # data handling
import numpy as np                  # numerical methods
import matplotlib.pyplot as plt     # plots

Load data on baseball players. Each row is a player. The variable we would like to predict is salary.

In [14]:

base = pd.read_csv('./Data/Hitters.csv')
base.head()

Out[14]:

	Unnamed: 0	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	...	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	Salary	NewLeague
0	-Andy Allanson	293	66	1	30	29	14	1	293	66	...	30	29	14	A	E	446	33	20	NaN	A
1	-Alan Ashby	315	81	7	24	38	39	14	3449	835	...	321	414	375	N	W	632	43	10	475.0	N
2	-Alvin Davis	479	130	18	66	72	76	3	1624	457	...	224	266	263	A	W	880	82	14	480.0	A
3	-Andre Dawson	496	141	20	65	78	37	11	5628	1575	...	828	838	354	N	E	200	11	3	500.0	N
4	-Andres Galarraga	321	87	10	39	42	30	2	396	101	...	48	46	33	N	E	805	40	4	91.5	N

5 rows × 21 columns

The data look okay, but we only have salary for 263 observations. Let's drop them.

In [15]:

base = base.dropna()
base.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  263 non-null    object 
 1   AtBat       263 non-null    int64  
 2   Hits        263 non-null    int64  
 3   HmRun       263 non-null    int64  
 4   Runs        263 non-null    int64  
 5   RBI         263 non-null    int64  
 6   Walks       263 non-null    int64  
 7   Years       263 non-null    int64  
 8   CAtBat      263 non-null    int64  
 9   CHits       263 non-null    int64  
 10  CHmRun      263 non-null    int64  
 11  CRuns       263 non-null    int64  
 12  CRBI        263 non-null    int64  
 13  CWalks      263 non-null    int64  
 14  League      263 non-null    object 
 15  Division    263 non-null    object 
 16  PutOuts     263 non-null    int64  
 17  Assists     263 non-null    int64  
 18  Errors      263 non-null    int64  
 19  Salary      263 non-null    float64
 20  NewLeague   263 non-null    object 
dtypes: float64(1), int64(16), object(4)
memory usage: 45.2+ KB

3.1 Ordinary Least Squares (OLS)¶

Let's start with OLS to get a feel for how cross-validation can help us identify the "best" (ie, least risk of over-fitting) model. Start by loading the linear model package:

In [16]:

from sklearn import linear_model                          # ols, ridge, lasso, 

Let's choose some variables that are potentially useful for predicting salary. We are purposely making this set large. The goal is determine how to constrain our choices. Let's look at some descriptive statistics:

In [17]:

var_list = ['Hits', 'RBI', 'HmRun', 'Walks', 'Errors', 'Years', 'Assists', 'AtBat', 'Runs', 'CAtBat', 'CHits', 'CRuns', 'CWalks']
base[var_list].describe()

Out[17]:

	Hits	RBI	HmRun	Walks	Errors	Years	Assists	AtBat	Runs	CAtBat	CHits	CRuns	CWalks
count	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000
mean	107.828897	51.486692	11.619772	41.114068	8.593156	7.311787	118.760456	403.642586	54.745247	2657.543726	722.186312	361.220532	260.266160
std	45.125326	25.882714	8.757108	21.718056	6.606574	4.793616	145.080577	147.307209	25.539816	2286.582929	648.199644	331.198571	264.055868
min	1.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	19.000000	0.000000	19.000000	4.000000	2.000000	1.000000
25%	71.500000	30.000000	5.000000	23.000000	3.000000	4.000000	8.000000	282.500000	33.500000	842.500000	212.000000	105.500000	71.000000
50%	103.000000	47.000000	9.000000	37.000000	7.000000	6.000000	45.000000	413.000000	52.000000	1931.000000	516.000000	250.000000	174.000000
75%	141.500000	71.000000	18.000000	57.000000	13.000000	10.000000	192.000000	526.000000	73.000000	3890.500000	1054.000000	497.500000	328.500000
max	238.000000	121.000000	40.000000	105.000000	32.000000	24.000000	492.000000	687.000000	130.000000	14053.000000	4256.000000	2165.000000	1566.000000

We'll eventualliy be using the ridge regression which works best if the $X$ variables are on the same scale. The StandardScaler() normalizes the variables. Doing this now makes the analsysis across models clearer. We'll see this again when we do classification.

In [18]:

from sklearn.preprocessing import StandardScaler          # normalize variables to have mu=0, std=1

# Standardize the X vars so they have mean = 0 and std = 1
X = StandardScaler().fit_transform(base[var_list])
df=pd.DataFrame(X)
df.columns = var_list

Let's visualize the data after the scaling:

In [19]:

fig, ax = plt.subplots(figsize=(15,10))

sns.boxplot(ax=ax,data=df,) 

ax.set_ylabel('Value',size=20)
ax.set_xlabel('\nRHS Data Variable',size=20)
ax.set_title('Visualizing the Standard Scaler',size=32)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.yticks(fontsize=16)
ax.set_xticklabels([x for x in var_list],fontsize=16)

plt.show()

OK, they're now in the same scale but there's still a lot of variation. Estimate the OLS regression. Don't worry about the l2 norm stuff yet.

In [20]:

res_ols = linear_model.LinearRegression().fit(X, base['Salary'])
coef_norm_ols = np.linalg.norm(res_ols.coef_, ord=2)

print(res_ols.coef_)
print('The l2 norm of the ols coefficients is {0:5.1f}.'.format(coef_norm_ols))

[ 342.86853172   56.76095882   31.72386749  150.31756567  -10.80513874
  -28.76438817   25.9039893  -299.7189859   -95.89253928 -346.33866351
  257.07761972  448.22175687 -157.44988341]
The l2 norm of the ols coefficients is 810.4.

3.2 Ridge Regression¶

Whereas we specificed the $X$ variables in OLS presumably based on some economic theory (b/c OLS makes sense in econometrics), the ridge regression is linear penalization regression which modifies the $X$ variables to best fit the data. Specifically, we choose $f(\;)$ to minimize the residual sum of squares (RSS) plus a penalty function:

$$ \begin{align*} \min_{\beta}& \Bigg\{\underbrace{\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2}_{\text{RSS}} + \underbrace{\alpha \times\left(\sum_{j=1}^p \beta_j^2\right)^{1/2}}_{\text{Penalty}}\Bigg\} \end{align*} $$

where $p$ corresponds to the number of variables in $X$. Since OLS minimizes RSS, ridge collapses to OLS when $\alpha=0$. We call $\alpha$ the tuning parameter. When $\alpha>0$, models are penalized for how large their coefficients are. The term multiplying $\alpha$ is the l2 (read "ell two") norm of the coefficient vector.

The Ridge() function is part of linear_models in scikit-learn (docs).

First, let's estimate the model with $\alpha=0$. This should return the OLS estimate.

In [21]:

res_ridge_0 = linear_model.Ridge(alpha = 0.0).fit(X, base['Salary'])
coef_norm_r0 = np.linalg.norm(res_ridge_0.coef_, ord=2)

In [22]:

pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_ , 'Ridge 0':res_ridge_0.coef_})

Out[22]:

	Variable	OLS	Ridge 0
0	Hits	342.868532	342.868532
1	RBI	56.760959	56.760959
2	HmRun	31.723867	31.723867
3	Walks	150.317566	150.317566
4	Errors	-10.805139	-10.805139
5	Years	-28.764388	-28.764388
6	Assists	25.903989	25.903989
7	AtBat	-299.718986	-299.718986
8	Runs	-95.892539	-95.892539
9	CAtBat	-346.338664	-346.338664
10	CHits	257.077620	257.077620
11	CRuns	448.221757	448.221757
12	CWalks	-157.449883	-157.449883

In [23]:

print('The l2 norm of the ridge(a=0) coefficients is {0:5.1f}.'.format(coef_norm_r0))

The l2 norm of the ridge(a=0) coefficients is 810.4.

Now estimate the ridge model with $\alpha=100$. This adds a penalty to each coefficient that is not zero.

In [24]:

res_ridge_100 = linear_model.Ridge(alpha = 100.0).fit(X, base['Salary'])
coef_norm_r100 = np.linalg.norm(res_ridge_100.coef_, ord=2)

print(res_ridge_100.coef_)
print('The l2 norm of the ridge(a=100) coefficients is {0:5.1f}.'.format(coef_norm_r100))

[ 52.15138012  38.4847365    8.50483461  50.13721432 -11.04140017
   5.05574661  -2.28073256   2.33576236  26.24338138  38.80799151
  59.32436779  59.90473347  20.04176666]
The l2 norm of the ridge(a=100) coefficients is 129.0.

That decreased the norm of the coefficients quite a bit. We can keep going...

In [25]:

res_ridge_800 = linear_model.Ridge(alpha = 800.0).fit(X, base['Salary'])
coef_norm_r800 = np.linalg.norm(res_ridge_800.coef_, ord=2)

print(res_ridge_800.coef_)
print('The l2 norm of the ridge(a=800) coefficients is {0:5.1f}.'.format(coef_norm_r800))

[24.26292247 22.88938261 14.84319012 25.27605131 -2.84991212 17.94244586
 -0.10430381 17.54807161 21.14484914 27.12005935 30.11332925 30.81759733
 23.94825642]
The l2 norm of the ridge(a=800) coefficients is  78.9.

In [26]:

pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_ , 
              'Ridge 100':res_ridge_100.coef_, 
              'Ridge 800':res_ridge_800.coef_ })

Out[26]:

	Variable	OLS	Ridge 100	Ridge 800
0	Hits	342.868532	52.151380	24.262922
1	RBI	56.760959	38.484737	22.889383
2	HmRun	31.723867	8.504835	14.843190
3	Walks	150.317566	50.137214	25.276051
4	Errors	-10.805139	-11.041400	-2.849912
5	Years	-28.764388	5.055747	17.942446
6	Assists	25.903989	-2.280733	-0.104304
7	AtBat	-299.718986	2.335762	17.548072
8	Runs	-95.892539	26.243381	21.144849
9	CAtBat	-346.338664	38.807992	27.120059
10	CHits	257.077620	59.324368	30.113329
11	CRuns	448.221757	59.904733	30.817597
12	CWalks	-157.449883	20.041767	23.948256

What size penalty?¶

We see that increasing $\alpha$ decreases the norm of the coefficients. How big should $\alpha$ be? We want the $\alpha$ that gives us the best test mse. As we discussed above, cross-validation provides a methodology to evaluate test MSEs.

The scikit package has a method that combines the ridge estimation with cross validation. It is called RidgeCV() (docs).

We pass to RidgeCV() a list of the alpha values to try.

In [27]:

alpha_grid = [1e-3, 1e-2, 1e-1, 1, 1e2, 1e3, 1e4]

# Setting 'store_cv_values' to true will hang on to all the test mses from the CV. Otherwise, it only keep the best one.
model = linear_model.RidgeCV(alphas=alpha_grid, store_cv_values = True).fit(X,base['Salary'])

The function returns an object with useful attributes and methods. Let's look at a few.

In [28]:

# The .alpha_ holds the best alpha
print('The best alpha from the candidate alphas is {0}.'.format(model.alpha_))

The best alpha from the candidate alphas is 1.0.

In [29]:

best_coef_ridge = pd.DataFrame({'Variable':var_list, 'Ridge': model.coef_}) 
print(best_coef_ridge)

   Variable       Ridge
0      Hits  309.382723
1       RBI   61.508414
2     HmRun   21.763078
3     Walks  140.610067
4    Errors  -12.050349
5     Years  -47.933631
6   Assists   19.984472
7     AtBat -277.165125
8      Runs  -70.493630
9    CAtBat -173.607372
10    CHits  177.955937
11    CRuns  355.839427
12   CWalks -139.569671

Since I set 'store_cv_values' to true, I have the matrix of all the test mse. Columns correspond to alpha values, and there is one row for each observation (ie, the function uses loocv).

In [30]:

print(model.cv_values_)
print(model.cv_values_.shape)

[[3.02843609e+04 3.01781068e+04 2.91875911e+04 ... 2.44419200e+00
  1.35781601e+03 3.14392936e+03]
 [4.90961822e+04 4.90869411e+04 4.89995087e+04 ... 1.81722250e+04
  9.46378842e+03 4.23064769e+03]
 [3.58334242e+05 3.58009759e+05 3.54643288e+05 ... 1.34723883e+05
  5.49162558e+04 6.19574577e+03]
 ...
 [3.57189181e+03 3.58819796e+03 3.73565465e+03 ... 1.18305515e+04
  1.32823134e+04 2.04707709e+04]
 [8.11076387e+04 8.10128280e+04 8.01251664e+04 ... 4.62882943e+04
  8.80650163e+04 1.58284001e+05]
 [2.87812484e+04 2.89069399e+04 3.01449649e+04 ... 4.67532198e+04
  9.66306892e+04 1.87097272e+05]]
(263, 7)

In [31]:

# The mean test mse for each value of alpha
model.cv_values_.mean(axis=0)

Out[31]:

array([124401.99686713, 124327.02083352, 123702.9384668 , 121624.58267113,
       125267.96774173, 136848.44181746, 184530.51723419])

And, of course, we can visualize the test mse across $\alpha$ values:

In [33]:

alpha_grid

Out[33]:

[0.001, 0.01, 0.1, 1, 100.0, 1000.0, 10000.0]

In [32]:

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(alpha_grid, model.cv_values_.mean(axis=0), marker = 'o', color='blue') 

ax.set_xlabel('Tuning Parameter ($\\alpha$)',size=16)
ax.set_ylabel('Average Test MSE (LOOCV)',size=16)
ax.set_title('Determining the Optimal Tuning Parameter ($\\alpha$)',size=24)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Kind of ugly but that's because our grid is so spread-out.

3.3 LASSO¶

The LASSO (least absolute shrinkage and selection operator) regression works like the ridge, but with a different penalty function.

$$ \begin{align*} \min_{\beta}& \Bigg\{\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2 + \underbrace{\alpha \times\sum_{j=1}^p | \beta_j|}_{\text{Penalty}}\Bigg\} \end{align*} $$

The penalty function here is the l1 norm, or the sum of the absolute values of the absolute values of the coefficient. The major difference between the ridge and the lasso is that the lasso can generate coefficients that are zero, dropping them from the model and making it simpler.

The Lasso regression is part of linear_models in scikit-learn (docs).

In [31]:

res_lasso_0 = linear_model.Lasso(alpha = 0.0,max_iter=10000).fit(X, base['Salary'])
coef_norm_l0 = np.linalg.norm(res_lasso_0.coef_, ord=1)

C:\Users\jt83241\AppData\Local\Temp\ipykernel_57432\366927581.py:1: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
  res_lasso_0 = linear_model.Lasso(alpha = 0.0,max_iter=10000).fit(X, base['Salary'])
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged.
  model = cd_fast.enet_coordinate_descent(
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.375e+07, tolerance: 5.332e+03 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(

In [32]:

pd.DataFrame({'Variabl': var_list, 'OLS': res_ols.coef_ , 'LASSO 0':res_lasso_0.coef_})

Out[32]:

	Variabl	OLS	LASSO 0
0	Hits	342.868532	342.868532
1	RBI	56.760959	56.760959
2	HmRun	31.723867	31.723867
3	Walks	150.317566	150.317566
4	Errors	-10.805139	-10.805139
5	Years	-28.764388	-28.764388
6	Assists	25.903989	25.903989
7	AtBat	-299.718986	-299.718986
8	Runs	-95.892539	-95.892539
9	CAtBat	-346.338664	-346.338664
10	CHits	257.077620	257.077620
11	CRuns	448.221757	448.221757
12	CWalks	-157.449883	-157.449883

In [33]:

print(f'The l2 norm of the LASSO(a=0) coefficients is {coef_norm_l0:,.1f}.')

The l2 norm of the LASSO(a=0) coefficients is 2,251.8.

In [34]:

res_lasso_2 = linear_model.Lasso(alpha = 2.0).fit(X, base['Salary'])
coef_norm_2 = np.linalg.norm(res_lasso_2.coef_, ord=1)

In [35]:

pd.DataFrame({'Variable':var_list, 'OLS': res_ols.coef_ , 'LASSO 2': res_lasso_2.coef_}) 

Out[35]:

	Variable	OLS	LASSO 2
0	Hits	342.868532	289.542400
1	RBI	56.760959	68.101418
2	HmRun	31.723867	2.138459
3	Walks	150.317566	123.811873
4	Errors	-10.805139	-3.454889
5	Years	-28.764388	-47.045848
6	Assists	25.903989	4.797088
7	AtBat	-299.718986	-260.218269
8	Runs	-95.892539	-41.705288
9	CAtBat	-346.338664	-0.000000
10	CHits	257.077620	10.262085
11	CRuns	448.221757	336.280067
12	CWalks	-157.449883	-123.029009

In [36]:

print(f'The l1 norm of the LASSO(a=2) coefficients is {coef_norm_2:,.1f}.')

The l1 norm of the LASSO(a=2) coefficients is 1,310.4.

Practice¶

Try estimating a lasso regression when $\alpha = 5$. Compare the coefficient vector to the lasso with $\alpha=2$ from above.

In [37]:

res_lasso_5 = linear_model.Lasso(alpha = 5.0).fit(X, base['Salary'])
coef_norm_5 = np.linalg.norm(res_lasso_5.coef_, ord=1)
pd.DataFrame({'Variable':var_list, 
              'OLS': res_ols.coef_ , 
              'LASSO 2': res_lasso_2.coef_,
              'LASSO 5': res_lasso_5.coef_}) 

Out[37]:

	Variable	OLS	LASSO 2	LASSO 5
0	Hits	342.868532	289.542400	187.192908
1	RBI	56.760959	68.101418	53.895913
2	HmRun	31.723867	2.138459	0.000000
3	Walks	150.317566	123.811873	84.479341
4	Errors	-10.805139	-3.454889	-4.387499
5	Years	-28.764388	-47.045848	-25.720740
6	Assists	25.903989	4.797088	-0.000000
7	AtBat	-299.718986	-260.218269	-144.987736
8	Runs	-95.892539	-41.705288	-0.000000
9	CAtBat	-346.338664	-0.000000	-0.000000
10	CHits	257.077620	10.262085	66.157744
11	CRuns	448.221757	336.280067	180.220618
12	CWalks	-157.449883	-123.029009	-32.203122

Estimate lasso regressions with $\alpha$ equal to 10 and 200. What is happening to the coefficients? What is happening to the norm of the coefficients? Does this make sense?

In [38]:

res_lasso_10 = linear_model.Lasso(alpha = 10.0).fit(X, base['Salary'])
coef_norm_10 = np.linalg.norm(res_lasso_10.coef_, ord=1)

res_lasso_200 = linear_model.Lasso(alpha = 200.0).fit(X, base['Salary'])
coef_norm_200 = np.linalg.norm(res_lasso_200.coef_, ord=1)

pd.DataFrame({'Variable':var_list, 
              'OLS': res_ols.coef_ , 
              'LASSO 2': res_lasso_2.coef_,
              'LASSO 5': res_lasso_5.coef_,
              'LASSO 10': res_lasso_10.coef_,
              'LASSO 200': res_lasso_200.coef_,
             }) 

Out[38]:

	Variable	OLS	LASSO 2	LASSO 5	LASSO 10	LASSO 200
0	Hits	342.868532	289.542400	187.192908	75.623355	0.000000
1	RBI	56.760959	68.101418	53.895913	39.965107	0.000000
2	HmRun	31.723867	2.138459	0.000000	0.000000	0.000000
3	Walks	150.317566	123.811873	84.479341	62.899989	0.000000
4	Errors	-10.805139	-3.454889	-4.387499	-8.154722	0.000000
5	Years	-28.764388	-47.045848	-25.720740	-0.000000	0.000000
6	Assists	25.903989	4.797088	-0.000000	-0.000000	0.000000
7	AtBat	-299.718986	-260.218269	-144.987736	-0.000000	0.000000
8	Runs	-95.892539	-41.705288	-0.000000	0.000000	0.000000
9	CAtBat	-346.338664	-0.000000	-0.000000	0.000000	0.000000
10	CHits	257.077620	10.262085	66.157744	56.927126	0.000000
11	CRuns	448.221757	336.280067	180.220618	135.248380	53.351392
12	CWalks	-157.449883	-123.029009	-32.203122	-0.000000	0.000000

In [39]:

norms = pd.DataFrame([coef_norm_2, coef_norm_5, coef_norm_10, coef_norm_200], index=[2, 5, 10, 200])
norms.rename(columns={0:'Coefficient Norm'},inplace=True)
norms.head()

Out[39]:

	Coefficient Norm
2	1310.386692
5	779.245621
10	378.818680
200	53.351392

Use the LassoCV() function (docs) to estimate LASSO regressions for the following grid of alphas:
```
alpha_grid = [1, 1.5, 2, 3, 4, 5, 6, 8, 10]
```
Set the cv parameter to 10. This means the method will use k-fold cross validation with k=10.

In [40]:

alpha_grid = [1, 1.5, 2, 3, 4, 5, 6, 8, 10]
model = linear_model.LassoCV(alphas=alpha_grid, cv=10).fit(X,base['Salary'])

Which $\alpha$ worked best?

In [41]:

print(f'The best alpha for this model is {model.alpha_:4.2f}.')

The best alpha for this model is 2.00.

Use the .mse_path_ attribute of the object returned by LassoCV() to return a (9,10) sized array. The 9 corresponds to the number of alphas and the 10 corresponds to the test mses from the 10-fold cross validation.

In [42]:

MSE = model.mse_path_.mean(axis=1)  # Average MSE for each alpha value
ALPHA = model.alphas_               # Set of alphas use in the grid

Create a plot with $\alpha$ on the x-axis and the average test mse on the y-axis. You will have to take an average of the MSEs using the .mean() attribute. To make sure the mse correspond to the correct alpha, use the .alphas_ attribute of the results object for the x values.

In [43]:

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(ALPHA, MSE, marker = 'o', color='blue') 

ax.set_xlabel('Tuning Parameter ($\\alpha$)',size=16)
ax.set_ylabel('Average Test MSE (10-fold cv)',size=16)
ax.set_title('Determining the Optimal Tuning Parameter ($\\alpha$)',size=24)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Now let the computer sort out the best tuning parameter. Call LassoCV() as before but do not use the alphas argument. The algorithm will search for the best $\alpha$ using cross-validation to prevent over-fitting. Set the k-fold cross-validation to 10 (ie, include argument cv=10).

In [44]:

model_2 = linear_model.LassoCV(alphas=None,cv=10).fit(X,base['Salary'])

What is the optimal $\alpha$? How many $\alpha$ values did the algorithm try? [Use the .alphas_ attribute.]

In [45]:

print(f'The best alpha for this model is {model_2.alpha_:4.2f}.')
print(f'The algorithm tried {len(model_2.alphas_):d} different alpha values.')

The best alpha for this model is 2.06.
The algorithm tried 100 different alpha values.

Let's visualize the relationship between $\alpha$ and mse. Using the mse_path_ and alphas_ attributes, create a plot with $\alpha$ on the x-axis and the average test mse on the y-axis.

In [46]:

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(model_2.alphas_, model_2.mse_path_.mean(axis=1), marker = 'o', color='blue') 

ax.set_xlabel('Tuning Parameter ($\\alpha$)',size=16)
ax.set_ylabel('Average Test MSE (10-fold cv)',size=16)
ax.set_title('Determining the Optimal Tuning Parameter ($\\alpha$)',size=24)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Display the optimal coefficient vector alongside the OLS and the best ridge coefficients. Did the lasso eliminate any variables? How do the models compare?

In [47]:

best_coef_lasso = pd.DataFrame({'Variable':var_list, 'LASSO': model_2.coef_}) 

Results = pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_})
Results = Results.merge(best_coef_ridge,on='Variable',how='left')
Results = Results.merge(best_coef_lasso,on='Variable',how='left')

Results.head()

Out[47]:

	Variable	OLS	Ridge	LASSO
0	Hits	342.868532	309.382723	286.786874
1	RBI	56.760959	61.508414	68.325339
2	HmRun	31.723867	21.763078	1.474243
3	Walks	150.317566	140.610067	122.818539
4	Errors	-10.805139	-12.050349	-3.312956

3.4 Elastic Net¶

Let's have our cake and eat it too. Enter the "Elastic Net". Two models enter (Ridge and LASSO) but only one model leaves. Not quite Bloodsport) but close.

The Elastic Net combines characteristics of both Ridge and Lasso; ie, it reduces the impact of different features while not eliminating all of the features. The Elastic Net is defined as:

$$ \begin{align*} \min_{\beta}& \Bigg\{\underbrace{\sum_{i=1}^n\big(y_i-\hat{y}_i\big)^2}_{\text{RSS}} + \lambda\times\bigg[\frac{(1-\alpha)}{2} \times\left(\sum_{j=1}^p \beta_j^2\right) + \alpha \times\sum_{j=1}^p | \beta_j|\Bigg\}\bigg] \end{align*} $$

The parameter $\alpha$ is the mixing parameter between Ridge ($\alpha=0$) and Lasso ($\alpha=1$). The $\lambda$ parameter is the penalty parameter where $\lambda=0$ generates OLS.

Confusingly:

We set the alpha hyperparameter via the l1_ratio argument which controls the contribution of the L1 and L2 penalties.
We set the lambda hyperparameter using alpha argument which controls the contribution of the sum of both penalties to the loss function.

Note that the default is an equal balance for the alpha hyperparamter (l1_ratio=0.5) and a weight of 1.0 used for the penality $\lambda$ parameter (alpha=1).

In [48]:

res_elastic = linear_model.ElasticNet(l1_ratio=0.5,alpha=1.0).fit(X, base['Salary'])

In [49]:

pd.DataFrame({'Variable':var_list, 'Elastic Net': res_elastic.coef_}) 

Out[49]:

	Variable	Elastic Net
0	Hits	46.741287
1	RBI	35.993662
2	HmRun	10.472049
3	Walks	46.321603
4	Errors	-9.678005
5	Years	9.412563
6	Assists	-1.885578
7	AtBat	8.004799
8	Runs	26.898680
9	CAtBat	38.156859
10	CHits	54.245400
11	CRuns	54.932313
12	CWalks	23.347402

Practice¶

Use ElasticNetCV to estimate an elastic net model with 10-fold cross-validation. Compare the results to the Ridge and LASSO results. Set the alpha values using the argument l1_ratio=l1_grid where we create a grid of potential $\alpha$ values:

l1_grid = np.arange(.1,1,1e-2).tolist()

In [50]:

l1_grid = np.arange(.1,1,1e-2).tolist()

# Fit the model. I set the number of alpha candidates for each lambda value to 1000 (the default is 100)
res_elastic = linear_model.ElasticNetCV(l1_ratio=l1_grid,n_alphas=1000,cv=10).fit(X, base['Salary'])

# Return the (lambda,alpha) pair that minimizes the test mse
print('The best lambda for this model is {0:6.4f}.'.format(res_elastic.alpha_))
print('The best alpha for this model is {0:6.4f}.'.format(res_elastic.l1_ratio_))

The best lambda for this model is 0.2899.
The best alpha for this model is 0.9300.

In [51]:

best_coef_elastic = pd.DataFrame({'Variable':var_list, 'Elastic Net': res_elastic.coef_}) 

Results = pd.DataFrame({'Variable': var_list, 'OLS': res_ols.coef_})
Results = Results.merge(best_coef_ridge,on='Variable',how='left')
Results = Results.merge(best_coef_lasso,on='Variable',how='left')
Results = Results.merge(best_coef_elastic,on='Variable',how='left')

Results

Out[51]:

	Variable	OLS	Ridge	LASSO	Elastic Net
0	Hits	342.868532	309.382723	286.786874	201.956558
1	RBI	56.760959	61.508414	68.325339	66.608628
2	HmRun	31.723867	21.763078	1.474243	3.222321
3	Walks	150.317566	140.610067	122.818539	109.878949
4	Errors	-10.805139	-12.050349	-3.312956	-13.959146
5	Years	-28.764388	-47.933631	-46.653438	-55.848516
6	Assists	25.903989	19.984472	4.403061	8.721780
7	AtBat	-299.718986	-277.165125	-257.913655	-180.999441
8	Runs	-95.892539	-70.493630	-39.978953	-17.765845
9	CAtBat	-346.338664	-173.607372	-0.000000	-16.116714
10	CHits	257.077620	177.955937	11.966717	137.178701
11	CRuns	448.221757	355.839427	332.473667	193.934717
12	CWalks	-157.449883	-139.569671	-121.022429	-80.818871

In [52]:

coeff_var = pd.DataFrame([res_ols.coef_.var(), 
                      best_coef_ridge['Ridge'].var(), 
                      best_coef_lasso['LASSO'].var(), 
                      best_coef_elastic['Elastic Net'].var()], index=['OLS', 'Ridge', 'LASSO', 'Elastic Net'])
coeff_var.rename(columns={0:'Variance of Coefficients'},inplace=True)
coeff_var.head()

Out[52]:

	Variance of Coefficients
OLS	49692.522353
Ridge	33497.511737
LASSO	23976.706027
Elastic Net	12270.408073

Comments The Elastic net looks closer to LASSO but uses all of the variables.

Lecture 3: Linear Penalization (SUGGESTED SOLUTIONS)

Class Announcements

1. Reminder: The Bias-Variance Trade-off (top)¶

Bias and variance trade-off¶

Overfitting¶

An Example of Our Proposed Approach¶

2. Practical "Set Validation" (top)¶

2.1 Leave one out cross-validation (loocv)¶

2.2 K-fold Cross-Validation¶

Practice¶

3. Shrinkage Methods (top)¶

3.1 Ordinary Least Squares (OLS)¶

3.2 Ridge Regression¶

What size penalty?¶

3.3 LASSO¶

Practice¶

3.4 Elastic Net¶

Practice¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia