Lecture 8: Causality via Double, Debiased ML [FINISHED]
In causal inference, we often estimate causal effects by conditioning the analysis on other variables. We usually refer to these variables as control or instrumental variables. In randomized control trials (aka AB tests), conditioning increases the power of the analysis by reducing imbalances that have emerged across groups despite randomization. However, conditioning is even more important in observational studies, where -- absent randomization -- identifying causality can be very difficult.
Complicating things is that finding good instruments is difficult. When we have many instruments, we might want to select the most relevant ones while allowing for (but not imposing) nonlinearities and interactions. We already know that ML algorithms are perfect for this task. This lecture is all about how to incorporate these algorithms to identify causal effects. Whereas before we only cared about predicting, now we're interested in using prediction to estimate causality so at the end of the day we're also interested in how ML algorithms affect bias and standard errors / confidence intervals.
In this lecture, you will see that ML algorithms introduce a bias called regularization or pre-testing bias. Interestingly, such a bias is actually quite common in empirical work but the nature of ML models makes the bias acute which is to say the researcher can't pretend it doesn't exist. Today, we'll cover the source of the bias and discuss two powerful solutions:
Post-double selection
Double debiased machine learning.
The former will conect directly to pre-test bias while the latter is a generalization for nonlinear controls/ functions. My objective in this lecture is that you understand -- even if at just a high-level -- the post-double selection and can implement it. I'm introducing Double, Debiased ML so you're aware of it and know how why it's important.
Specifically, we'll discuss the following in this lecture:
**Warning:** This lecture runs the risk of blowing you away (in not a good way) with technical details so I've tried to communicate intuition wherever possible. The discussion will get mathy and I'm not expecting you to fully grasp the mathematical complexities of double ML. Understanding these ideas -- even at a high level -- will be important for you in your careers. Why? Because causality is fundamental for organizations who want to leverage data to make impactful, science-based decisions and these are likely the most important tools at the intersection of machine learning and causal inference of the last decade. For example, tech firms such as Amazon, Microsoft, and Uber now routinely use these tools to make data-driven decisions.
Class Announcements
PS3 (Causality) will be on eLC later today and is due due by end of day Sunday. This PS will be much shorter than PS2. Usual office hours this week.
1. Pre-Testing Bias (top)
We begin by generating some data on sales and advertising. What we'll be doing is called Monte Carlo simulation and we've seen this before when we discussed the bias-variance tradeoff. It's useful because by simulating the data we can see how effective our tools are at recovering the true underlying parameters.
import numpy as np
import pandas as pd
def generate_data(a=1, b=.3, c=3, N=1000, seed=1):
np.random.seed(seed)
# Past Sales
past_sales = np.random.normal(5, 1, N)
# Advertisement
ads = c*past_sales + np.random.normal(-3, 1, N)
# Education
sales = a*ads + b*past_sales + np.random.normal(0, 1, N)
# Generate the dataframe
df = pd.DataFrame({'ads': ads, 'sales': sales, 'past_sales': past_sales})
return df
df = generate_data()
df.head()
ads | sales | past_sales | |
---|---|---|---|
0 | 16.719800 | 19.196620 | 6.624345 |
1 | 7.732222 | 9.287491 | 4.388244 |
2 | 10.923469 | 11.816906 | 4.471828 |
3 | 8.457062 | 9.024376 | 3.927031 |
4 | 13.085146 | 12.814823 | 5.865408 |
Our simulated data includes 1000 different markets where in each we observe current sales, the amount spent in advertisement and past sales.
We want to understand whether ads spending is effective in increasing sales. We could run one of the two following competing models:
$$
\begin{align}
\text{"Complex" }\;\; \text{sales}_t &=\alpha \times \text{ads}_t + \beta\times \text{sales}_{t-1} + \epsilon_t\\
\text{"Simple" }\;\; \text{sales}_t &=\beta\times \text{ads}_{t} + \epsilon_t
\end{align}
$$
A useful feature of these regressions is that they're nested. Since we are not sure whether to condition the analysis on past sales, we could therefore let the data decide estimating the "complex" regression and, if the effect of past sales, $\beta$, is statistically significant, choose this model. Otherwise, go with the "simple" model. BTW I hate myself for anthropomorphising data. Data can't talk, make decisions, skip, or jump.
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
smf.ols('sales ~ ads + past_sales', df).fit().summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.1405 | 0.185 | 0.758 | 0.448 | -0.223 | 0.504 |
ads | 0.9708 | 0.030 | 32.545 | 0.000 | 0.912 | 1.029 |
past_sales | 0.3381 | 0.095 | 3.543 | 0.000 | 0.151 | 0.525 |
It seems that the effect of past sales on current sales is positive and significant. Therefore, we are happy with our specification and we conclude that the effect of ads on sales is positive and significant with a 95% confidence interval of [0.912, 1.029].
Visualizing pre-test bias¶
A subtle but important point is that we are not taking into account the fact that we have run a test to decide whether to include past_sales
in the regression. Doing so affects inference on the effect of ads
on sales
which is say it affects the standard errors of $\alpha$ and therefore potentially our confidence in identifying a causal effect.
Don't worry if these words don't immediately make sense to you. Part of the usefulness of monte carlos is that we an simulate our decision-making process by evaluating what would happen if we were repeating this procedure multiple times. Here is the procedure:
- We draw a new sample from the data generating process.
- We regress sales on ads and past_sales.
- If the coefficient of past_sales is significant at the 95% level, we keep $\hat \alpha$ from (2).
- Otherwise, we regress sales on ads only, and we keep that coefficient.
Note: The difference between (3) and (4) is where all of the magic happens this is where a continuous object $\hat\beta$ creates a discrete choice in the framework because of our (fairly arbitary) decision about the "critical value." For example, consider the case of two $\hat{\beta}$ values on either side of the critical value. For one we choose the "complex" model while for the other we choose the "simple". Our choice will have a big effect on $\hat\alpha$ even thoough the $\hat\beta$ aren't all that different.
In the cell below I've written some code to do this. Let's run it and see what happens.
def pre_testing(d='ads', y='sales', x='past_sales', K=1000, **kwargs):
# Initialize the df to collect results across simulations
alpha = pd.DataFrame({'Complex': np.zeros(K),
'Simple': np.zeros(K),
'Pre-test': np.zeros(K)})
# Loop over simulations
for k in range(K):
# Generate data
df = generate_data(seed=k, **kwargs)
# Compute coefficients
alpha['Complex'][k] = smf.ols(f'{y} ~ {d} + {x}', df).fit().params[1]
alpha['Simple'][k] = smf.ols(f'{y} ~ {d}', df).fit().params[1]
# Compute significance of beta
p_value = smf.ols(f'{y} ~ {d} + {x}', df).fit().pvalues[2]
# Select specification based on p-value
if p_value<0.05:
alpha['Pre-test'][k] = alpha['Complex'][k]
else:
alpha['Pre-test'][k] = alpha['Simple'][k]
return alpha
alphas = pre_testing()
The code above loops through different simulated samples which all are generated by the same DGP and for each records a hypothetical estimation as a row in a dataframe. Let's take a look at the output.
alphas
Complex | Simple | Pre-test | |
---|---|---|---|
0 | 0.998192 | 1.079866 | 0.998192 |
1 | 0.970778 | 1.070600 | 0.970778 |
2 | 1.038937 | 1.089262 | 1.089262 |
3 | 1.009559 | 1.091312 | 1.009559 |
4 | 0.994370 | 1.083864 | 0.994370 |
... | ... | ... | ... |
995 | 0.967369 | 1.098052 | 0.967369 |
996 | 0.986134 | 1.083485 | 0.986134 |
997 | 0.987781 | 1.087176 | 0.987781 |
998 | 1.013875 | 1.099418 | 1.013875 |
999 | 0.979449 | 1.095231 | 0.979449 |
1000 rows × 3 columns
And we can visualize the results:
import matplotlib.pyplot as plt
import seaborn as sns
def plot_alphas(alphas, true_alpha):
# Init plot
fig, ax = plt.subplots(1, len(alphas.columns), figsize=(5*len(alphas.columns), 5), sharey=True, sharex=True)
# Make one plot for each set of coefficients
for i, key in enumerate(alphas.keys()):
ax[i].hist(alphas[key], bins=30, lw=.1)
ax[i].set_title(key,size=18)
ax[i].axvline(true_alpha, c='r', ls='--')
legend_text = [r'$\alpha=%.0f$' % true_alpha, r'$\hat \alpha=%.4f$' % np.mean(alphas[key])]
ax[i].legend(legend_text, prop={'size': 12}, loc='upper right',frameon=False)
sns.despine(ax=ax[i])
plot_alphas(alphas, true_alpha=1)
As we can see from the first plot, if we were always running the "complex" regression, our estimator $\hat a$ would be unbiased and normally distributed. However, if we were always running the "simple" regression (second plot), our estimator $\hat a$ would be biased (wrong). This is fundamentally an issue of omitted variable bias.
The pre-testing procedure generates an estimator $\hat \alpha$ that is a mix of the two where most of the time we select the correct specification but sometimes the pre-test fails to reject the null hypothesis of no effect of past sales on sales and we select the incorrect specification, running the "simple" regression.
Importantly, the pre-testing procedure does not generate a biased estimator. As we can see in the last plot, the estimated coefficient is very close to the true value, 1. The reason is that most of the time, the number of times we select the "simple" regression is sufficiently small not to introduce bias, but not small enough to have valid inference.
Indeed, pre-testing distorts inference since the distribution of the estimator $\alpha$ is not normal anymore -- It's bimodal due to the omitted variable bias. Statistically, this means our confidence intervals for $\alpha$ are going to have the wrong coverage and will contain the true effect with a different probability than the claimed one.
The Pre-testing Problem as Omitted Variable Bias¶
The problem of pre-testing arises because of the bias generated by running the simple regression: Omitted Variable Bias (OVB). Here, this shows up because we don't have enough statistical power to avoid Type-2 errors. The answer is therefore simple: A bigger sample size. If we have more observations, we can more precisely estimate $\beta$ and the likelihood of Type-2 errors decreases.
To see this, simulate the estimated $\hat\alpha$ under different sample sizes. Remember that the sample size used until now is N=1000.
alphas = pd.DataFrame(alphas)
alphas
Complex | Simple | Pre-test | |
---|---|---|---|
0 | 0.998192 | 1.079866 | 0.998192 |
1 | 0.970778 | 1.070600 | 0.970778 |
2 | 1.038937 | 1.089262 | 1.089262 |
3 | 1.009559 | 1.091312 | 1.009559 |
4 | 0.994370 | 1.083864 | 0.994370 |
... | ... | ... | ... |
995 | 0.967369 | 1.098052 | 0.967369 |
996 | 0.986134 | 1.083485 | 0.986134 |
997 | 0.987781 | 1.087176 | 0.987781 |
998 | 1.013875 | 1.099418 | 1.013875 |
999 | 0.979449 | 1.095231 | 0.979449 |
1000 rows × 3 columns
Ngrid = [100,300,1000,3000]
alphas = {f'N = {n:.0f}': pre_testing(N=n)['Pre-test'] for n in Ngrid}
alphas = pd.DataFrame(alphas)
plot_alphas(alphas, true_alpha=1)
As we can see from the plots, as the sample size increases (left to right), the bias decreases, and the distribution of $\hat \alpha$ converges to a normal distribution with mean at the true value.
What happens instead if the value of $\beta$ was different? Put differently, how does the importance of the omitted variable bias affect the pre-testing bias.
betas = 0.3 * np.array([0.1,0.3,1,3])
alphas = {f'beta = {b:.2f}': pre_testing(b=b)['Pre-test'] for b in betas}
alphas = pd.DataFrame(alphas)
plot_alphas(alphas, true_alpha=1)
As the value of $\beta$ increases, the bias first appears and then disappears. When $\beta$ is small (left-most plot), we often choose the "simple" regression. This reflects a scenario where the bias is small and the average estimate is very close to the true value so the costs of omitted variables is negligible.
As the value of $\beta$ increases (ie, as we move from left to right) the importance of past sales inceases. When it becomes big (right plot) we always run the "complex" regression and the bias disappears. Hence, pre-testing bias (omitted variable bias) is an issue whenever the role of confounding variables such as $\beta$ are of intermediate importance. Oh good. What does that even mean practically?!
Let's repeat the same simulation as above but now increase the coefficient and the sample size simultaneously.
betas = 0.3 * 30 / np.sqrt(Ngrid)
alphas = {f'N = {n:.0f}': pre_testing(b=b, N=n)['Pre-test'] for n,b in zip(Ngrid,betas)}
alphas = pd.DataFrame(alphas)
plot_alphas(alphas, true_alpha=1)
We see that the distortion exists no matter the sample size. This exercise captures the idea of "magnitude" (ie, $\beta$ "big" or "small") in a world where we do inference relying on asymptotic results such as the Central Limit Theorem. What do we learn here? Inference will always be wrong when there is pre-testing bias.
A subtle but important point to shake your belief in reality. Every paper you've ever read likely suffers from pre-testing bias since the researchers likely ran many regressions or model-fitting exercises before settling on the "best" one.
2. Pre-Testing and Machine Learning (top)
So far we talked about a linear regression with only 2 variables. Where is the machine learning?
Usually, we have many instrumental variable and are unsure of which is best. Moreover, we might want to be flexible with respect to the functional form through which these control variables enter the model.
Where the effect of interest is still $\alpha$, X is potentially high-dimensional and we do not take a stand on the functional form through which $X$ influences $D$ or $Y$. In our example, Y corresponds to sales, D corresponds to ads, X corresponds to past_sales and the effect of interest is α. In our example, X is 1-dimensional for simplicity, but generally, we are interested in cases where X is high-dimensional, potentially even having more dimensions than the number of observations. In that case, variable selection is essential in linear regression since we cannot have more features than variables (the OLS coefficients are not uniquely determined).
LASSO as a motivating example.¶
It would be natural to use a ML algorithm to estimate $g()$ and $m()$ but ML algorithms introduce a regularization bias comparable to pre-testing. To see this, consider LASSO which is linear in X, with a penalization term that effectively just performs the variable selection as we did in our example. Therefore, using LASSO of X and D on Y we would be introducing regularization bias and inference would be distorted. The same goes for more complex algorithms.
The post-double selection algorithm¶
Addressing regularization / pretesting bias is a recipe called post-double selection which was introduced in
A. Belloni, D. Chen, V. Chernozhukov, C. Hansen, Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain (2012), Econometrica.
A. Belloni, V. Chernozhukov, C. Hansen, Inference on treatment effects after selection among high-dimensional controls (2014), The Review of Economic Studies.
The idea behind post-double selection is to bound the omitted variable bias. Post-double selection consists of the following procedure.
Step 1: Reduced Form selection
Do LASSO of Y on X. Select the statistically significant variables in the set.
Step 2: First Stage selection
Regress D on X. Select the statistically significant variables in the set.
Step 3: Estimation
Regress Y on D and the union of the selected variables in the first two steps.
The name post-double selection comes from the fact that now we are not performing variable selection not once but twice.
Practice ¶
Go back to our example and try the post-double selection procedure:
- First Stage selection: Regress
ads
onpast_sales
. Check ifpast_sales
is statistically significant. We don't have enough variables for LASSO to make sense so just do OLS.
- Reduced Form selection: Regress
sales
onpast_sales
. Check ifpast_sales
is statistically significant.
- Regress
sales
onads
and includepast_sales
only if it was significant in either one of the two previous regressions.
Visualizing the algorithm¶
In the cell below I updated the pre_test
function defined above to also solve for the post-double selection estimator using the outlined algorithm. Let's try it out.
def pre_test2(d='ads', y='sales', x='past_sales', K=1000, **kwargs):
# Init
alphas = pd.DataFrame({'Complex': np.zeros(K),
'Simple': np.zeros(K),
'Pre-test': np.zeros(K),
'Post-double': np.zeros(K)})
# Loop over simulations
for k in range(K):
# Generate data
df = generate_data(seed=k, **kwargs)
# Compute coefficients
alphas['Complex'][k] = smf.ols(f'{y} ~ {d} + {x}', df).fit().params[1]
alphas['Simple'][k] = smf.ols(f'{y} ~ {d}', df).fit().params[1]
# Compute significance of beta and gamma
p_value_ydx = smf.ols(f'{y} ~ {d} + {x}', df).fit().pvalues[2]
p_value_yx = smf.ols(f'{y} ~ {x}', df).fit().pvalues[1]
p_value_dx = smf.ols(f'{d} ~ {x}', df).fit().pvalues[1]
# Select pre-test specification based on regression of y on d and x
if p_value_ydx<0.05:
alphas['Pre-test'][k] = alphas['Complex'][k]
else:
alphas['Pre-test'][k] = alphas['Simple'][k]
# Select post-double specification based on regression of y on d and x
if p_value_yx<0.05 or p_value_dx<0.05:
alphas['Post-double'][k] = alphas['Complex'][k]
else:
alphas['Post-double'][k] = alphas['Simple'][k]
return alphas
alphas = pre_test2()
plot_alphas(alphas, true_alpha=1)
Practice ¶
Earlier we saw that pre-testing was problem when sample size is small, $\beta$ is unimportant, and/or when the value of $\beta$ depends on the sample size.
- Use the above code to confirm that post-double selection removes regularization bias when the sample size is small.
def plot_alphas2(alphas, true_alpha):
# Init plot
fig, ax = plt.subplots(1, len(alphas), figsize=(5*len(alphas), 5), sharey=True, sharex=True)
# Make one plot for each set of coefficients
for i, key in enumerate(alphas.keys()):
ax[i].hist(alphas[key]['Pre-test'], bins=30, lw=.1, alpha=0.5)
ax[i].hist(alphas[key]['Post-double'], bins=30, lw=.1, alpha=0.5, color='C2')
ax[i].set_title(key,size=18)
ax[i].axvline(true_alpha, c='r', ls='--')
ax[i].legend([rf'$\alpha=${true_alpha}', 'Pre-test', 'Post-double'],
prop={'size': 10}, loc='upper right', frameon=False)
sns.despine(ax=ax[i])
Ngrid = [100,300,1000,3000]
alphas = {f'N = {n:.0f}': pre_test2(N=n) for n in Ngrid}
plot_alphas2(alphas, true_alpha=1)
- Use the above code to confirm that post-double selection removes regularization bias when the value of $\beta$ is "important" (ie, vary $\beta$).
betas = 0.3 * np.array([0.1,0.3,1,3])
alphas = {rf'$\beta$ = {b:.2f}': pre_test2(b=b) for b in betas}
plot_alphas2(alphas, true_alpha=1)
- Use the above code to confirm that post-double selection removes regularization bias even when $\beta$ varies with sample size.
betas = 0.3 * 30 / np.sqrt(Ngrid)
alphas = {rf'N = {n:.0f}, $\beta$ = {b:.2f}': pre_test2(b=b, N=n) for n,b in zip(Ngrid,betas)}
plot_alphas2(alphas, true_alpha=1)
3. Double Debiased Machine Learning (top)
So far, we only have analyzed a linear, univariate example. What happens if the dimension of X increases and we do not know the functional form through which X affects Y and D? In these cases, we can use machine learning algorithms to uncover these high-dimensional non-linear relationships. Suppose we have a non-linear model:
$$ \begin{align} y&= \alpha D + g(X) + u\\ D &= m(X) + v \end{align} $$where Y is the outcome variable, D is the treatment to interest and X is a potentially high-dimensional set of control variables. The difference with respect to the previous setting is that now we leave the relationships between X and Y and D unspecified, through the functions $g()$ and $m()$.
A naive approach to the estimation of α using machine learning methods would be, for example, to construct a sophisticated machine learning estimator for learning the regression function $\alpha D + g(X)$.
- Split the sample in two (train-test slit)
- Use the training data to estimate $\hat g(X)$
- Use the testing sample to compute the orthogonalized component of Y on X
- Use the testing data to estimate the residualized OLS estimator from regressing $\hat u$ on $D$:
It will be biased because we are employing high dimensional regularized estimators (e.g. we are doing variable selection) and we risk overfitting bias. Overfitting shows up because the sample used to select the variables is the same that is used to estimate the coefficient of interest. How to address this? Train-test splot for both the selection and the estimation procedures.
Double-debiased ML was developed in
V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins, Double/debiased machine learning for treatment and structural parameters (2018), The Econometrics Journal.
and solves the problem using the same idea as the post-double selection idea outlined above: Reduce the regularization bias by performing variable selection twice. The estimator is still valid because of the Frisch-Waugh-Lovell theorem which hopefully you remember from your econometrics course.
Below is the recipe for data variables Y, D, and X which are connected via the following model:
$$
\begin{align}
y&= \alpha D + g(X) + u\\
D &= m(X) + v
\end{align}
$$
- Split the sample in two: A main sample and an auxiliary sample.
- Use the auxiliary sample to estimate ĝ(X) from
$$ Y = \alpha D + g(X) + u $$
where we remember that $g(X)$ could capture a very complicated and nonlinear relationship between Y and X (eg, Random Forest).
- Use the auxiliary sample to estimate $\hat m$ from
$$ D = m(X) + v $$
and $m(X)$ we also want to allow for (but not impose) the mapping from X to Y to be very complicated.
- Use the main sample to compute the orthogonalized component of D on X as
$$ \hat v = D - \hat m(X) $$
Note that we've switched samples here. This is like fitting $m(X)$ using the training data and then using testing data to generated predictions and residuals.
- Use the main sample to estimate the double-residualized OLS estimator as
This procedure is valid since the two estimates are independent of the sample splitting procedure.
Discussion¶
The double-debiased machine learning model implicitly assumes that the control variables X are (weakly) common causes to both the outcome Y and the treatment D. If this is the case, and no further mediated/indirect relationship exists between X and Y, there is no problem. However, if, for example, some variable among the controls X is a common effect instead of a common cause, its inclusion will bias the coefficient of interest. Moreover, this variable is likely to be highly correlated either with the outcome Y or with the treatment D. In the latter case, this implies that post-double selection might include it in cases in which simple selection would have not. Therefore, in presence of bad control variables, double-debiased machine learning might be even worse than simple pre-testing. It is therefore crucial to have a clear understanding of how and why the variables you include in the procedure belong in the model. As with anything in ML, the toolbox is not a silver bullet and requires a knowledge as an input to ensure the output makes sense.