Lecture 7: Causality via 2SLS, Difference-in-Differences, and Synthetic Controls [FINISHED]
Synthetic Control methods are the "most important development in program evaluation in the last decade."
-- Susan Athey and Guido Imbens (2016) "The State of Applied Econometrics - Causality and Policy Evaluation."
We know that correlation is not causation but it can be tricky to identify the latter from the former. Causation is measuring the real impact on Y because of X; e.g., What is the effect of ad campaigns on the sales of a product?
It is critical to precisely understand the causal effects of these interventions on the subject. One of the main threats to causal inference is the confounding effect from other variables. In the case of ad campaigns, it could be a reduction in price of the product, change in the overall economy or various other factors that could be inducing the change in sales at the same time. So how do we correctly attribute the change in sales because of the ad campaign?
There are two ways to estimate the true causal impact of the intervention on the subject.
- Randomized Experiment: The researcher designs an experiment such that participants are assigned the treatment at random. Since the treatment is randomized, the group of participants which receive the treatment are observationally identical to residents who do not receive the treatment (e.g., vaccine trials). The researcher follows the two groups to evaluate whether the treatment affects outcomes of interest. This approach is the most reliable method to infer the actual causal impact of the treatment a lot of interesting and important questions cannot be addressed via experimentation for a variety of logistical and ethical reasons.
- Causal Inference: The alternative is to apply statistical procedures to existing data to estimate causality indepent of confounding factors. This will be our focus. Note that instrumental variables (IV) regression falls in this category.
Specifically, we'll discuss the following in this lecture:
Class Announcements
PS2 is due by end of day today (Tuesday) to eLC. Extra office hours Tuesday afternoon until 4 pm.
1. Estimating a Demand Curve (top)
During the 1880s, a cartel known as the Joint Executive Committee (JEC) controlled the rail transport of grain from the Midwest to eastern cities in the United States. The cartel preceded the Sherman Antitrust Act of 1890, and it legally operated to increase the price of grain above what would have been the competitive price. From time to time, cheating by members of the cartel brought about a temporary collapse of the collusive price-setting agreement.
In this exercise, you will use variations in supply associated with the cartel's collapses to estimate the elasticity of demand for rail transport of grain.
The file 'JEC.xls' contains the data required for this exercise. It contains weekly observations on the rail shipping price and other factors from 1880 to 1886. See 'JEC_data_description.pdf' for a detailed description of the variables.
Suppose that the demand curve equation for rail transport of grain is specified as
$$\ln(Q_i) = \beta_0 + \beta_1 \ln(P_i) + \beta_2 Ice_i + \sum_{j=1}^{12} \beta_{2+j} Seas_{j,i} + u_i,$$where $Q_i$ is the total tonnage of grain shipped in week $i$, $P_i$ is the price of shipping one ton (2,000 pounds) of grain by rail, $Ice_i$ is a binary variable that is equal to 1 if the Great Lakes are not navigable because of ice, and $Seas_{j,i}$ is a binary variable that captures seasonal variation in demand (nb, there are 13 potential "months" where 13 corresponds to 12/4-12/31). $Ice$ is included because grain could also be transported by ship when the Great Lakes were navigable.
1.1 Correlation¶
We load the data as a Pandas dataframe and estimate the demand curve equation by OLS using the HC3 estimator for the standard errors, print the summary table of the results.
We're interested in the estimated value of the price elasticity of demand.
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
jec = pd.read_excel('./Data/JEC.xls')
jec.head(10)
week | price | cartel | quantity | seas1 | seas2 | seas3 | seas4 | seas5 | seas6 | seas7 | seas8 | seas9 | seas10 | seas11 | seas12 | ice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.40 | 1 | 13632 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 0.40 | 1 | 20035 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 3 | 0.40 | 1 | 16319 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 0.40 | 1 | 12603 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 5 | 0.40 | 1 | 23079 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 6 | 0.40 | 1 | 19652 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
6 | 7 | 0.40 | 1 | 16211 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
7 | 8 | 0.40 | 1 | 22914 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
8 | 9 | 0.40 | 1 | 23710 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
9 | 10 | 0.35 | 1 | 23036 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# We need to build the string that specifies the regression. Here are two ways
# Tedious way (remember what triple quotes do?):
# model_str = '''np.log(quantity) ~ np.log(price) + ice + seas1 + seas2 + seas3 + seas4 +
# seas5 + seas6 + + seas7 + seas8 + seas9 + seas10 + seas11 + seas12'''
# Let Python do the work!
varlist = ['seas' + str(x) for x in range(1,13)]
seasons = ' + '.join(varlist) # The .join() method concatenates all the elements
# in varlist and puts a plus sign between them
model_str = 'np.log(quantity) ~ np.log(price) + ice +' + seasons
OLS Estimation¶
ols_res = smf.ols(model_str, data=jec).fit(cov_type = 'HC3')
print(ols_res.summary())
print('\n\nThe estimated price elasticity of demand is {0:2.3f}, and its standard error is {1:2.3f}.'
.format(ols_res.params['np.log(price)'], ols_res.HC3_se['np.log(price)']))
OLS Regression Results ============================================================================== Dep. Variable: np.log(quantity) R-squared: 0.313 Model: OLS Adj. R-squared: 0.282 Method: Least Squares F-statistic: 11.14 Date: Tue, 01 Nov 2022 Prob (F-statistic): 1.14e-20 Time: 15:10:23 Log-Likelihood: -154.95 No. Observations: 328 AIC: 339.9 Df Residuals: 313 BIC: 396.8 Df Model: 14 Covariance Type: HC3 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 8.8612 0.189 46.931 0.000 8.491 9.231 np.log(price) -0.6389 0.075 -8.473 0.000 -0.787 -0.491 ice 0.4478 0.145 3.081 0.002 0.163 0.733 seas1 -0.1328 0.099 -1.345 0.179 -0.326 0.061 seas2 0.0669 0.094 0.715 0.474 -0.116 0.250 seas3 0.1114 0.100 1.116 0.265 -0.084 0.307 seas4 0.1554 0.137 1.137 0.256 -0.113 0.423 seas5 0.1097 0.137 0.801 0.423 -0.159 0.378 seas6 0.0468 0.189 0.248 0.804 -0.324 0.417 seas7 0.1226 0.212 0.579 0.563 -0.293 0.538 seas8 -0.2350 0.187 -1.254 0.210 -0.602 0.132 seas9 0.0036 0.185 0.019 0.985 -0.359 0.366 seas10 0.1692 0.186 0.912 0.362 -0.194 0.533 seas11 0.2152 0.185 1.161 0.246 -0.148 0.578 seas12 0.2196 0.183 1.203 0.229 -0.138 0.577 ============================================================================== Omnibus: 14.928 Durbin-Watson: 0.528 Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.666 Skew: -0.505 Prob(JB): 0.000396 Kurtosis: 3.353 Cond. No. 35.6 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) The estimated price elasticity of demand is -0.639, and its standard error is 0.075.
Practice ¶
- Explain why the interaction of supply and demand will make our OLS estimates biased. Can you sign the bias?
Answer:
Prices and quantities are determined by the intersection of supply and demand in market equilibrium. Due to this simultaneity problem, we cannot simply run OLS to estimate the demand curve. If we do run OLS, it is unclear which curve we are actually estimating -- in some sense we are not (consistently) estimating either supply or demand!
Alternatively, you could think of $u_i$ as an unobserved factor that shifts demand. If $u_i$ is positive, then demand is shifted to the right, which would increase both price and quantity when supply is held fixed. Thus $Cov(\ln(P_i), u_i) > 0$ and the OLS coefficient is biased upwards.
If you would like to know more, this note by Douglas Steigerwald is provides a more detailed explanation.
1.2 Causality¶
We separate causation from correlation by introducing an instrument. Intuitively, we need something that shifts supply but not demand. If we have a variable that exogenously shifts the supply curve, we can use these shifts to trace out the demand curve. An exogenous change in firm costs (due to changes in input prices, taxes, regulations) is a great instrument because it elicits a change in the firm's pricing decision.
A subtle and interesting point is all products are connected in the (pricing) equilibrium:
- a cost change for one product of a multiproduct firm effectively changes prices for all products sold by the firm.
- a cost change for one firm leads to other firms changing their prices.
Back to the JEC case
We could use $cartel_i$ (see JEC_data_description.pdf for this variable's definition) as an instrumental variable for $\ln(P_i)$.
Does this (plausibly) satisfy the two conditions for a valid instrument?
For $cartel_i$ to be a valid instrument for $\ln(P_i)$, it must satisfy the following two conditions:
- Relevance: $Cov(cartel_i,\ln(P_i)) \neq 0$
- Exogeneity: $Cov(cartel_i, u_i) = 0$
The first condition is satisfied: the cartel acts to restrict quantity and increase prices, so $Cov(cartel_i,\ln(P_i)) > 0$.
The second condition requires that the operation of the cartel was unrelated to other unobserved factors that shift the demand for rail transport of grain. We know that the cartel faced an information asymmetry problem and would randomly penalize itself so it seems plausible that $Cov(cartel_i, u_i) = 0$.
2SLS Estimation¶
We'll estimate the demand equation using two stage least squares with $cartel_i$ as an instrument for $\ln(P_i)$. We're interested in the demand price elasticity which amounts to the estimated coefficient on $\ln(P)$.
NOTE: You might need to install the linearmodels
package, i.e.,
pip install linearmodels
import linearmodels.iv as iv
# The seasons variable is the same as in the first regression.
model_str_iv = 'np.log(quantity) ~ 1 + [np.log(price) ~ cartel] + ice + ' + seasons
iv_res = iv.IV2SLS.from_formula(model_str_iv, jec).fit()
print(iv_res.summary)
print('\n\nThe estimated price elasticity of demand is {0:2.3f}, and its standard error is {1:2.3f}.'
.format(iv_res.params['np.log(price)'], iv_res.std_errors['np.log(price)']))
IV-2SLS Estimation Summary ============================================================================== Dep. Variable: np.log(quantity) R-squared: 0.2959 Estimator: IV-2SLS Adj. R-squared: 0.2644 No. Observations: 328 F-statistic: 165.29 Date: Tue, Nov 01 2022 P-value (F-stat) 0.0000 Time: 15:10:28 Distribution: chi2(14) Cov. Estimator: robust Parameter Estimates ================================================================================= Parameter Std. Err. T-stat P-value Lower CI Upper CI --------------------------------------------------------------------------------- Intercept 8.5735 0.2106 40.701 0.0000 8.1607 8.9864 ice 0.4229 0.1315 3.2160 0.0013 0.1652 0.6807 seas1 -0.1310 0.1005 -1.3027 0.1927 -0.3280 0.0661 seas10 0.0858 0.1738 0.4937 0.6215 -0.2549 0.4265 seas11 0.1518 0.1716 0.8845 0.3764 -0.1846 0.4882 seas12 0.1787 0.1669 1.0707 0.2843 -0.1484 0.5057 seas2 0.0910 0.0927 0.9807 0.3267 -0.0908 0.2727 seas3 0.1359 0.0981 1.3852 0.1660 -0.0564 0.3281 seas4 0.1525 0.1314 1.1610 0.2456 -0.1050 0.4100 seas5 0.0736 0.1271 0.5786 0.5629 -0.1756 0.3227 seas6 -0.0061 0.1722 -0.0352 0.9719 -0.3435 0.3314 seas7 0.0602 0.1964 0.3066 0.7591 -0.3247 0.4452 seas8 -0.2936 0.1708 -1.7194 0.0855 -0.6283 0.0411 seas9 -0.0584 0.1714 -0.3405 0.7334 -0.3943 0.2776 np.log(price) -0.8666 0.1307 -6.6285 0.0000 -1.1228 -0.6103 ================================================================================= Endogenous: np.log(price) Instruments: cartel Robust Covariance (Heteroskedastic) Debiased: False The estimated price elasticity of demand is -0.867, and its standard error is 0.131.
**Here is a subtle but important point -- a good price instrument makes demand more elastic. We know this from economic theory.**
2. Terrorist Bombings in the Basque Country (top)¶
We're going to estimate the economic impact of terrorism in the Basque region -- an autonomous community in Spain. To do this, we'll use information from 17 other Spanish regions where our underlying assumption will be that terrorism affected economic activity in the Basque region but not elsewhere whereas other economic shocks are aggregate and affect all of Spain.
Here are the details:
- Coverage from 1955–1997 for 18 Spanish regions. One of the data "regions" is all of Spain which we won't use.
- The treatment region is “Basque Country (Pais Vasco)”.
- The "treatment" year is 1975 since there were several bombings around that year.
- We will measure "economic impact" via GDP per capita (in thousands).
Background: Euskadi Ta Askatasuna (ETA), was an armed leftist Basque nationalist and separatist organization in the Basque Country (in northern Spain and southwestern France). The group was founded in 1959 and later evolved from a group promoting traditional Basque culture to a paramilitary group engaged in a violent campaign of bombing, assassinations and kidnappings in the Southern Basque Country and throughout Spanish territory. Its goal was gaining independence for the Basque Country. ETA was the main group within the Basque National Liberation Movement and was the most important Basque participant in the Basque conflict. While its terrorist activities spanned several decades, the death of Spanish dictator Francisco Franco in 1975 led to a substantial increase in bombings. (Wikipedia)
Loading Data from R¶
We've learned how to load data from csv, excel, and stata. Another popular programming language (and to some degree a rival of python) is the open-source language R. The data we want to access is from the following academic paper
Abadie, A. and Gardeazabal, J. (2003) "Economic Costs of Conflict: A Case Study of the Basque Country." American Economic Review 93 (1) 113--132.
where the data was saved in the R programming language. We'll do this by leveraging python's own open-source nature and load a user-created package called pyreadr
which we install as usual in the terminal:
pip install pyreadr
We then use pyreadr
to load basque.RData
.
# Load necessary packages
import numpy as np
import pandas as pd
import datetime as dt # for time and date
import pyreadr
result = pyreadr.read_r('./Data/basque.RData')
# done! The result is a dictionary where keys are the name of objects
print(result.keys())
odict_keys(['basque'])
data = result['basque'] # extract the pandas dataframe
# Let's take a look at the data
print('Variable Types:')
print(data.dtypes)
print('\n Variable Statistics:')
data.describe()
Variable Types: regionno float64 regionname object year float64 gdpcap float64 sec.agriculture float64 sec.energy float64 sec.industry float64 sec.construction float64 sec.services.venta float64 sec.services.nonventa float64 school.illit float64 school.prim float64 school.med float64 school.high float64 school.post.high float64 popdens float64 invest float64 dtype: object Variable Statistics:
regionno | year | gdpcap | sec.agriculture | sec.energy | sec.industry | sec.construction | sec.services.venta | sec.services.nonventa | school.illit | school.prim | school.med | school.high | school.post.high | popdens | invest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 774.000000 | 774.000000 | 774.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 108.000000 | 108.000000 | 108.000000 | 108.000000 | 108.000000 | 18.000000 | 576.000000 |
mean | 9.500000 | 1976.000000 | 5.394987 | 20.268445 | 5.188556 | 23.915333 | 7.211556 | 36.485222 | 6.934556 | 308.051261 | 2118.524959 | 145.613775 | 45.943488 | 25.458404 | 105.769444 | 21.395883 |
std | 5.191482 | 12.417698 | 2.242637 | 10.376150 | 4.035458 | 9.281848 | 1.361570 | 7.261439 | 1.978371 | 630.842242 | 4216.779749 | 297.452345 | 92.106997 | 51.579575 | 101.518005 | 4.111414 |
min | 1.000000 | 1955.000000 | 1.243430 | 1.320000 | 1.600000 | 9.560000 | 4.340000 | 26.230000 | 3.430000 | 8.097660 | 151.320892 | 8.609827 | 3.063398 | 1.660274 | 22.379999 | 9.331671 |
25% | 5.000000 | 1965.000000 | 3.693034 | 13.542500 | 2.697500 | 17.809999 | 6.240000 | 31.247500 | 5.497500 | 40.290497 | 432.293228 | 26.513481 | 9.131781 | 4.407411 | 44.772499 | 18.742375 |
50% | 9.500000 | 1976.000000 | 5.335690 | 19.240000 | 3.675000 | 23.135000 | 7.135000 | 34.750000 | 6.680000 | 116.232128 | 852.125793 | 47.752241 | 16.696211 | 7.712549 | 80.375000 | 21.350711 |
75% | 14.000000 | 1987.000000 | 6.869091 | 27.482500 | 6.080000 | 27.480000 | 8.177500 | 39.192500 | 7.927500 | 252.270393 | 1763.311646 | 119.039047 | 38.758285 | 19.058688 | 122.567497 | 23.750510 |
max | 18.000000 | 1997.000000 | 12.350043 | 46.500000 | 21.360001 | 46.220001 | 11.280000 | 58.209999 | 13.110000 | 2863.278320 | 19459.558594 | 1696.146851 | 474.941162 | 252.250000 | 442.450012 | 39.409801 |
Before proceeding into the econometrics, it's useful to graph the data variation we're most interested in. In this project, we're interested in evaluating the effect of terrororism on GDP per capita
where we're using the bombings in 1975 as a "natural experiment." We'll do so using Plotly
.
terror = data[['regionno','regionname','year','gdpcap']]
terror.loc[:,'year'] = pd.to_datetime(terror['year'].copy(), format='%Y')
terror.drop(terror[terror['regionname'] == 'Spain (Espana)'].index, inplace = True) # Drop Spain (Aggregate) Observations
terror.set_index('year',inplace=True)
print('\n Region Indicators:')
print(terror.groupby('regionname')['regionno'].agg(['mean']))
Region Indicators: mean regionname Andalucia 2.0 Aragon 3.0 Baleares (Islas) 5.0 Basque Country (Pais Vasco) 17.0 Canarias 6.0 Cantabria 7.0 Castilla Y Leon 8.0 Castilla-La Mancha 9.0 Cataluna 10.0 Comunidad Valenciana 11.0 Extremadura 12.0 Galicia 13.0 Madrid (Comunidad De) 14.0 Murcia (Region de) 15.0 Navarra (Comunidad Foral De) 16.0 Principado De Asturias 4.0 Rioja (La) 18.0
C:\Users\jt83241\AppData\Local\Temp\ipykernel_21036\3161361956.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy terror.loc[:,'year'] = pd.to_datetime(terror['year'].copy(), format='%Y') C:\Users\jt83241\AppData\Local\Temp\ipykernel_21036\3161361956.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy terror.drop(terror[terror['regionname'] == 'Spain (Espana)'].index, inplace = True) # Drop Spain (Aggregate) Observations
import matplotlib.pyplot as plt
import seaborn as sns
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.axvline(x=dt.datetime(1975,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1970,1,1),9),va='center',ha='center',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1980,1,1),9),va='center',ha='center',size=18)
ax.plot(terror[terror['regionno'] == 17].index, terror[terror['regionno'] == 17]['gdpcap'],color='blue')
ax.set_ylim(3,10)
ax.set_yticks(np.arange(3, 10.01, step=1.0))
ax.set_ylabel('Per Capita GDP',size=14)
ax.set_xlabel(' ')
ax.set_title('Per Capita GDP Over Time (Basque Region)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
We observe that per capita GDP decreases after the 1975 increase in ETA terrorist bombings and then increases. How much of that decrease is due to terrorism and how much is due to general economic uncertainty due to Franco's death?
3. First-Difference (top)¶
Thus The far : Terrorist Bombings in the Basque Countrywe've learned how to load data from csv, excel, and stata. Another popular programming language (and to some degree a riva
We'll begin by comparing the per capit GDP before and after the 1975 treament year. One way to do this is to construct a "first difference" equation by having per capita GDP
as the dependent variable and adding a pre-post indicator as the independent variable.
where $1_{\{t>1975\}}$ is equal to one for all years after 1975 and zero otherwise.
import statsmodels.api as sm # provides statistical models like ols, gmm, anova, etc...
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
# Create treatment variable
terror['treat'] = 0
terror.loc[(terror.index>dt.datetime(1975, 1, 1)),'treat'] = 1
FD = smf.ols('gdpcap ~ treat', data=terror[terror['regionno'] == 17]).fit()
print(FD.summary())
OLS Regression Results ============================================================================== Dep. Variable: gdpcap R-squared: 0.549 Model: OLS Adj. R-squared: 0.538 Method: Least Squares F-statistic: 49.92 Date: Tue, 01 Nov 2022 Prob (F-statistic): 1.33e-08 Time: 15:10:34 Log-Likelihood: -66.097 No. Observations: 43 AIC: 136.2 Df Residuals: 41 BIC: 139.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 5.3823 0.252 21.399 0.000 4.874 5.890 treat 2.4844 0.352 7.065 0.000 1.774 3.195 ============================================================================== Omnibus: 10.194 Durbin-Watson: 0.171 Prob(Omnibus): 0.006 Jarque-Bera (JB): 3.091 Skew: 0.259 Prob(JB): 0.213 Kurtosis: 1.793 Cond. No. 2.65 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\jt83241\AppData\Local\Temp\ipykernel_21036\773602531.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy terror['treat'] = 0
The regression indicates that the ETA terrorishm increased GDP per capita by $\approx$2.5 units. Seems odd. What's happening?
Compare the graph and the regression. The 2.5 unit increase is because of trend growth but GDP could be growing for a lot reasons and frankly what we're interested in is understanding how ETA terrorism may have inhibited that growth. What we need to do is separate the effect of the bombings from all of the things which could be confounding the results.
BTW, I hope some part of you is thinking "This regression is dumb because all it's really doing is comparing average per capita GDP before 1975 to average per capita GDP after wards." To see this, let's simply compute these averages and compare:
print(terror[terror['regionno'] == 17].groupby('treat')['gdpcap'].agg(['mean']))
mean treat 0 5.382258 1 7.866707
Per capita GDP after treatment is 7.87 while before treatment is 5.38 with a difference of -- not surprisingly -- 2.48!
Let's see what happened graphically:
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.axvline(x=dt.datetime(1975,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1970,1,1),9),va='center',ha='center',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1980,1,1),9),va='center',ha='center',size=18)
avgpre = np.mean(terror[(terror['regionno'] == 17)&(terror['treat'] == 0)]['gdpcap'])
avgpost = np.mean(terror[(terror['regionno'] == 17)&(terror['treat'] == 1)]['gdpcap'])
ax.axhline(y=avgpre, xmin=0, xmax=1, color='red', linewidth=2.0, dashes=[5,5])
ax.axhline(y=avgpost, xmin=0, xmax=1, color='red', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period Avg = '+ str(round(avgpre,2)),xy=(dt.datetime(1954,1,1),avgpre),va='bottom',ha='left',size=16)
ax.annotate('Post-Period Avg = '+ str(round(avgpost,2)),xy=(dt.datetime(1954,1,1),avgpost),va='bottom',ha='left',size=16)
ax.plot(terror[terror['regionno'] == 17].index, terror[terror['regionno'] == 17]['gdpcap'],color='blue')
ax.set_ylim(3,10)
ax.set_yticks(np.arange(3, 10.01, step=1.0))
ax.set_ylabel('Per Capita GDP',size=14)
ax.set_xlabel(' ')
ax.set_title('Per Capita GDP Over Time (Basque Region)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
We'll establish causality by comparing the trend of a "control" group to the Basque region. This is like conducting a an experiment where you give some patients a vaccine, drug, etc. while you give other patients a placebo. In that experimental setting, we design the experiment such that the control and treatment groups are identical but for the treament applied. We then observe whether or not outcomes-of-interest are significantly different between the two groups.
The same rationale holds here but of course we don't have an experimental setting so we choose a control group after the fact under the assumption this group provides an adequate proxy for the trend that would have been observed in the treatment group in the absence of treatment. if true, the difference in change of slope would be the actual treatment effect. Hence, the "difference in differences" terminology. Note that our "identification strategy" requires the treatment and control groupsfollow the same pre-period (i.e., pre-treatment) trend which is something we can easilly show.
In fact, we'll choose our control group using the following criterion: lowest variation in % difference of pre-period per capita GDP across years between each region and the Basque country.
variation = terror.pivot(columns='regionname',values='gdpcap') # Use existing index
variation.drop(variation[variation['Basque Country (Pais Vasco)'] == 0].index, inplace = True)
variation.drop(variation[variation.index > dt.datetime(1975,1,1)].index, inplace = True)
corr = variation.corr()
interest = corr['Basque Country (Pais Vasco)'].copy()
interest.sort_values(ascending=False, inplace=True)
interest.drop(interest.index[0], inplace=True)
print("\nPre-Trends (in Descending Order)")
print(interest)
Pre-Trends (in Descending Order) regionname Aragon 0.997043 Cataluna 0.996830 Navarra (Comunidad Foral De) 0.996262 Rioja (La) 0.995469 Principado De Asturias 0.995289 Cantabria 0.995258 Castilla Y Leon 0.992800 Murcia (Region de) 0.991812 Comunidad Valenciana 0.991300 Madrid (Comunidad De) 0.988504 Baleares (Islas) 0.987941 Andalucia 0.987689 Galicia 0.984275 Extremadura 0.984099 Castilla-La Mancha 0.981486 Canarias 0.975647 Name: Basque Country (Pais Vasco), dtype: float64
temp = terror.pivot(columns='regionname',values='gdpcap') # Use existing index
subset = temp[['Basque Country (Pais Vasco)','Aragon','Cataluna']].copy()
subset.rename(columns={'Basque Country (Pais Vasco)':'Basque Country'},inplace=True)
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.axvline(x=dt.datetime(1975,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1970,1,1),9),va='center',ha='center',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1980,1,1),9),va='center',ha='center',size=18)
ax.plot(subset.index, subset['Basque Country'],color='blue',label='Basque Country (Treated)')
ax.plot(subset.index, subset['Aragon'],color='red',label='Aragon (Countrol 1)')
ax.plot(subset.index, subset['Cataluna'],color='green',label='Cataluna (Countrol 2)')
ax.legend(frameon=False,fontsize=18)
ax.set_ylim(2,12)
ax.set_yticks(np.arange(3, 10.01, step=1.0))
ax.set_ylabel('Per Capita GDP',size=14)
ax.set_xlabel(' ')
ax.set_title('Per Capita GDP Over Time (Select Regions)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Aragon follows the Basque country not only pre-period but also post-period. Look at a map and you observe that the Basque country and Aragon are neighbors so it seems reasonable that GDP effects related to ETA activities may have also had an impact on Aragon so the latter is not a great control. Cataluna, however, is farther away so it better achieves our "exclusion restriction". Let's redo the analysis using the DiD estimator as follows:
subset = temp[['Basque Country (Pais Vasco)','Cataluna']].copy()
subset.rename(columns={'Basque Country (Pais Vasco)':'Basque'},inplace=True)
# Create treatment variable
subset['Post'] = 0
subset.loc[(subset.index>dt.datetime(1975, 1, 1)),'Post'] = 1
FD = smf.ols('Basque ~ Post + Cataluna + Cataluna*Post', data=subset).fit()
print(FD.summary())
OLS Regression Results ============================================================================== Dep. Variable: Basque R-squared: 0.991 Model: OLS Adj. R-squared: 0.990 Method: Least Squares F-statistic: 1378. Date: Tue, 01 Nov 2022 Prob (F-statistic): 1.36e-39 Time: 15:10:36 Log-Likelihood: 17.242 No. Observations: 43 AIC: -26.48 Df Residuals: 39 BIC: -19.44 Df Model: 3 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 0.2610 0.180 1.451 0.155 -0.103 0.625 Post 0.4438 0.292 1.521 0.136 -0.147 1.034 Cataluna 0.9767 0.034 29.096 0.000 0.909 1.045 Cataluna:Post -0.1422 0.043 -3.328 0.002 -0.229 -0.056 ============================================================================== Omnibus: 17.526 Durbin-Watson: 0.387 Prob(Omnibus): 0.000 Jarque-Bera (JB): 23.446 Skew: 1.284 Prob(JB): 8.10e-06 Kurtosis: 5.548 Cond. No. 115. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Discussion: Using the Cataluna region as control indicates that ETA-related terrorism reduced per capita GDP by 0.14 units. Moreover, this decline is statistically significant.
5. Synthetic Control (top)¶
Generally speaking (umm... writing), the "synthetic control" method allows for estimation in settings where a single unit (a state, country, firm, etc.) is exposed to an event or intervention but it's not obvious who or what the control group should be. Specifically, synthetic control provides a data-driven procedure to construct a control group using a convex combination of comparison units. The idea is that this generated (ie, synthetic) control group approximates the characteristics of the unit of interest prior to treatment. The thought (hope) is that a combination of comparison units provides a better comparison for the unit exposed to the intervention than any single comparison unit.
In our example above, we identfied Cataluna as our control variable by generating a metric to assess pre-period trends. Fortunately, Catalunya turned out to be a nice fit -- how fortunate! In many (most?) cases, we will have too much data to identify a control group by "eyeballing" the data and we'll likely not be able to satisfy the parallel trend pre-period assumption. What if instead we could create a control group via a convex combination of potential candidates? That's the idea behind "synthetic controls" -- we create a control group.
Synthetic control is a technique which is very similar to DiD in estimating the true impact of a treatment. Both the methods use the help of control groups to construct a counter-factual of the treated group giving us an idea of what the trend is if the treatment had not happened. The counter-factual GDP of the treated group would be predicted by the GDP of the control groups and also other possible covariates in the control group.
Just as the DiD approach used the control (Catalunya) to construct a counter-factual of the treated group giving us an idea of what the trend is if the treatment had not happened, "synthetic control" predicts the counter-factual by assigning weights to the regressors in the control groups identify individual regressors and their influence in prediction. Ultimately, the true causal impact is the difference in GDP between actual GDP and the counter-factual GDP if the treatment had not happened which is the same idea as DiD.
As before, let's look at the raw data before proceeding:
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
for r in terror.regionname.unique():
if r=="Basque Country (Pais Vasco)":
ax.plot(terror[terror['regionname']==r].index, terror[terror['regionname']==r]['gdpcap'],
linewidth=5.0, dashes=[5,5],label=r)
else:
ax.plot(terror[terror['regionname']==r].index, terror[terror['regionname']==r]['gdpcap'],label=r)
ax.legend(frameon=False)
ax.axvline(x=dt.datetime(1975,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1970,1,1),11),va='center',ha='center',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1980,1,1),11),va='center',ha='center',size=18)
ax.set_ylim(0,12)
ax.set_yticks(np.arange(0, 12.01, step=1.0))
ax.set_ylabel('Per Capita GDP',size=14)
ax.set_xlabel(' ')
ax.set_title('Per Capita GDP Over Time (Basque Region)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Idea: What we'll be doing is picking the convex combination of regions to best match the evolution of per capita GDP prior to the ETA bombings (i.e., before 1975). We'll then hold these weights fixed and generate per capita GDP for the synthetic post-treatment control group. That's it. Identifying the causal effect then follows our DiD regression where we use the synthetic control rather than a specific control (e.g., Cataluna). A nice feature of this approach is that we can be hands-off in selection of our control group. Plus, construction of our synthetical control is not based on the treatment period so there's minimal risk of inadvertantly baking-in our results.
Caveat Emptor: The following code use our knowledge of machine learning (LASSO) to generate the candidate weights. While machine learning algorithms like this have become increasingly popular to generate weights for the synthetic control group, the traditional approach is to estimate weights which sum to one and are non-negative. LASSO imposes neither. If you want to get into the weeds, the conditions of non-negative weights which also sum to one are important to guarantee the weights used to construct the synthetic control group are unique (i.e., there is only one way to generate the pre-sample so there is also only one match for constructing the synthetical control after treatment). That said, I like my approach as it instills greater intuition, connects to what we've done before, and does not involve a big fancy black-box function (and the accompanying annoying syntax). **At the end of the notebook we construct the synthetic control group "properly."**
Let's get to work...
from sklearn import linear_model # ML package. We'll use lasso.
train = terror.copy()
train.drop(train[train.index > dt.datetime(1975,1,1)].index, inplace = True)
X = train.pivot(columns='regionname',values='gdpcap') # Use existing index
Y = X['Basque Country (Pais Vasco)']
X.drop('Basque Country (Pais Vasco)', axis=1, inplace=True)
res_lasso = linear_model.LassoCV(alphas=None,cv=5,max_iter=10000).fit(X,Y) # Fit the model on pre-treatment data.
# NB, sample is pretty small so using a smallish k-fold cross validation.
print(f'The best alpha for this model is {res_lasso.alpha_:4.2f}.')
print(f'The algorithm tried {len(res_lasso.alphas_):d} different alpha values.')
The best alpha for this model is 0.01. The algorithm tried 100 different alpha values.
# Whole sample
X = terror.pivot(columns='regionname',values='gdpcap') # Use existing index
X.drop('Basque Country (Pais Vasco)', axis=1, inplace=True)
control = pd.DataFrame(res_lasso.predict(X))
# Add column to original dataframe making sure that the indexes match
subset = terror.pivot(columns='regionname',values='gdpcap') # Use existing index
control = control.set_index(subset.index) # Transfer main index to this new df
subset['Synthetic Control'] = control # Add column to main dataset
subset.columns
Index(['Andalucia', 'Aragon', 'Baleares (Islas)', 'Basque Country (Pais Vasco)', 'Canarias', 'Cantabria', 'Castilla Y Leon', 'Castilla-La Mancha', 'Cataluna', 'Comunidad Valenciana', 'Extremadura', 'Galicia', 'Madrid (Comunidad De)', 'Murcia (Region de)', 'Navarra (Comunidad Foral De)', 'Principado De Asturias', 'Rioja (La)', 'Synthetic Control'], dtype='object', name='regionname')
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(subset.index, subset['Basque Country (Pais Vasco)'],color='blue',label='Basque Country (Treated)')
ax.plot(subset.index, subset['Synthetic Control'],color='red',label='Synthetic Control')
ax.legend(frameon=False,fontsize=18)
ax.axvline(x=dt.datetime(1975,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1970,1,1),11),va='center',ha='center',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1980,1,1),11),va='center',ha='center',size=18)
ax.set_ylim(0,12)
ax.set_yticks(np.arange(0, 12.01, step=1.0))
ax.set_ylabel('Per Capita GDP',size=14)
ax.set_xlabel(' ')
ax.set_title('Per Capita GDP Over Time (Basque Region)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
The graph demonstrates the following:
- The synthetic control fits pre-period by construction.
- The actual effect is then the divergence post-period. In the graph this is the difference between the blue and red lines.
The difference in trends in the post-period will be our "average treatment effect" which we will quantify via the DiD approach using the "synthetic control" (i.e., post-period prediction using pre-period LASSO coefficients as weights) as the control group.
# DiD Regression to Identify Causality using the Synthetic Control
# Rename the columns to simplify syntax in the ols regression
subset.rename(columns={'Basque Country (Pais Vasco)':'Basque','Synthetic Control':'Synthetic'},inplace=True)
# Create treatment variable
subset['Post'] = 0
subset.loc[(subset.index>dt.datetime(1975, 1, 1)),'Post'] = 1
FD = smf.ols('Basque ~ Post + Synthetic + Synthetic*Post', data=subset).fit()
print(FD.summary())
OLS Regression Results ============================================================================== Dep. Variable: Basque R-squared: 0.991 Model: OLS Adj. R-squared: 0.990 Method: Least Squares F-statistic: 1449. Date: Tue, 01 Nov 2022 Prob (F-statistic): 5.14e-40 Time: 15:10:41 Log-Likelihood: 18.313 No. Observations: 43 AIC: -28.63 Df Residuals: 39 BIC: -21.58 Df Model: 3 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------- Intercept -0.0386 0.185 -0.209 0.836 -0.413 0.336 Post 0.5926 0.294 2.014 0.051 -0.003 1.188 Synthetic 1.0072 0.034 29.844 0.000 0.939 1.075 Synthetic:Post -0.1740 0.042 -4.099 0.000 -0.260 -0.088 ============================================================================== Omnibus: 21.484 Durbin-Watson: 0.439 Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.654 Skew: 1.475 Prob(JB): 4.92e-08 Kurtosis: 6.175 Cond. No. 122. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Discussion: We find that the synthetic control method reduces the negative effect of ETA bombings on the Basque economy from -0.14
(DiD) to -0.10
. These point-estimates are not statisticsally different from eachother (though both are statistically different than zero) which indicates that in this case the added toil of synthetic controls didn't add a lot quantitatively. Why?
Look back at the graph of GDP per capita for all regions and note that the regions look very similar to Cataluna. Hence, generating a convex combination of these others is pretty much the same as using Cataluna. Of course, this result is particular to our example and not generally true but it demonstrates that the Synthetic Control method nests the Difference-in-Differences approach.
The Effect of ETA Bombings The above approach demonstrates the average effect but we can estimate the treatment effect as the gap between treated and the synthetic control outcomes:
$$ \tau_{t} = y_{lt} - \hat y_{t}^S $$where $y_{lt}$ and $\hat y_{t}^S$ are the actual and (synthetic) control outcomes, resptively, and $\tau_{t}$ is the effect we're after. Let's plot it:
tau = subset['Basque'] - subset['Synthetic']
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(subset.index, tau,color='blue')
ax.axvline(x=dt.datetime(1975,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.axhline(y=0, xmin=0, xmax=1, color='gray', linestyle='--')
ax.annotate('Pre-Period',xy=(dt.datetime(1970,1,1),.1),va='center',ha='center',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1980,1,1),.1),va='center',ha='center',size=18)
ax.set_ylim(-1.5,0.2)
ax.set_yticks(np.arange(-1.5, 0.2, step=0.1))
ax.set_ylabel('Change in Per Capita GDP',size=14)
ax.set_xlabel(' ')
ax.set_title('Estimated Effects of ETA Bombings on Per Capita GDP\n (Synthetic Control Method)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Practice ¶
Let's try an example on your own to practice this technique.
You will estimate the effect of cigarette taxation on consumption:
- Taxes will increase the cost of cigarettes and lower their demand.
- Since cigarettes cause addiction, consumers will be unresponsive so a change in their price won’t effect demand much and the only effect of the policy is to tax users (which may be good if such funds are used to offset medicare expeninditure related to smoking).
Whether such a policy is pro-active and leads consumers to consumer less tobacco will be your task to find out.
The Data
In 1988, California passed the Tobacco Tax and Health Protection Act also known as Proposition 99. The primary effect was to
impose a 25-cent per pack state excise tax on the sale of tobacco cigarettes within California, with approximately equivalent excise taxes similarly imposed on the retail sale of other commercial tobacco products, such as cigars and chewing tobacco. Additional restrictions placed on the sale of tobacco include a ban on cigarette vending machines in public areas accessible by juveniles and a ban on the individual sale of single cigarettes. Revenue generated by the act was earmarked for various environmental and health care programs, and anti-tobacco advertisements.
To evaluate whether this tax reduced smoking, have data from 1970 to 2000 from 39 states.
- Load
smoking.csv
. Drop variableslnincome
,beer
, andage15to24
. Convertyear
to datetime and make it the index. The variablecigsales
is average annual per capita packs sold.
smoking = pd.read_csv('./Data/smoking.csv')
smoking.drop(columns=["lnincome","beer", "age15to24"],inplace=True)
smoking['year'] = pd.to_datetime(smoking['year'],format='%Y')
smoking.set_index('year',inplace=True)
smoking.head()
state | cigsale | retprice | |
---|---|---|---|
year | |||
1970-01-01 | Rhode Island | 123.9 | 39.3 |
1970-01-01 | Tennessee | 99.8 | 39.9 |
1970-01-01 | Indiana | 134.6 | 30.6 |
1970-01-01 | Nevada | 189.5 | 38.9 |
1970-01-01 | Louisiana | 115.9 | 34.3 |
- Plot cigarette sales (
cigsale
) across time for (a) California and (b) the average for the other states. Be sure to indicate 1988 when Proposition 99 goes in to effect. What trends do you see?
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(smoking[smoking['state']=='California'].index,
smoking[smoking['state']=='California']['cigsale'],color='blue',label='California')
ax.plot(smoking[smoking['state']=='California'].index,
smoking.loc[(smoking['state']!='California'),'cigsale'].groupby('year').mean(),
color='red',label='Not California (Average)')
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1987,1,1),140),va='center',ha='right',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1989,1,1),140),va='center',ha='left',size=18)
ax.legend(frameon=False,fontsize=18)
ax.set_ylim(0,150)
ax.set_yticks(np.arange(0, 150.01, step=10.0))
ax.set_ylabel('Per Capita Cigarette Sales (in packs)',size=14)
ax.set_xlabel(' ')
ax.set_title('Cigarette Sales Across Time (Select States)\n',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Use LASSO to create a Synthetic Control group using the 38 other states as the control group. Use 5-fold cross-validation to choose the penalty parameter $\alpha$ (ie,
cv=5
).
# Will be useful to save the LASSO algorithm as a function
def syn_control(state):
train = smoking.copy()
train.drop(train[train.index > dt.datetime(1988,1,1)].index, inplace = True)
X = train.pivot(columns='state',values='cigsale') # Use existing index
Y = X[state]
X.drop(state, axis=1, inplace=True)
# Fit Lasso
res_lasso = linear_model.LassoCV(alphas=None,cv=5,max_iter=10000).fit(X,Y) # Fit the model on pre-treatment data.
# Add synthetic data to sample
X = smoking.pivot(columns='state',values='cigsale') # Use existing index
X.drop(state, axis=1, inplace=True)
control = pd.DataFrame(res_lasso.predict(X))
# Add column to original dataframe making sure that the indexes match
data = smoking.pivot(columns='state',values='cigsale') # Use existing index
control = control.set_index(data.index) # Transfer main index to this new df
data['Synthetic Control'] = control # Add column to main dataset
return data
# Fit LASSO for California
subset = syn_control('California')
- Plot cigarette sales for California and the Synthetic Control you created. Be sure to indicate 1988 when Proposition 99 goes in to effect. What trends do you see?
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(subset.index, subset['California'],color='blue',label='California (Treated)')
ax.plot(subset.index, subset['Synthetic Control'],color='red',label='Synthetic Control')
ax.legend(frameon=False)
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1987,1,1),140),va='center',ha='right',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1989,1,1),140),va='center',ha='left',size=18)
ax.legend(frameon=False,fontsize=18)
ax.set_ylim(0,150)
ax.set_yticks(np.arange(0, 150.01, step=10.0))
ax.set_ylabel('Per Capita Cigarette Sales (in packs)',size=14)
ax.set_xlabel(' ')
ax.set_title('Cigarette Sales Across Time (Select States)\n',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Use the Synthetic Control approach to estimate the effect of Proposition 99 for all states. This requires running creating a synthetic control group for every state so create a loop across states in the data. Plot $\tau_t$ for each state and discuss the results.
- For California, this is the treatment effect.
- For the other states, this is a placebo effect since passage of Proposition 99 in California should not have effected cigarette sales in -- say -- Nevada. If it does, we have a problem and our control group needs to be refined.
- For California, this is the treatment effect.
subset = syn_control('California')
tau_lasso = subset['California'] - subset['Synthetic Control']
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(subset.index, tau_lasso,color='blue')
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.axhline(y=0, xmin=0, xmax=1, color='gray', linestyle='--')
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1987,1,1),5),va='center',ha='right',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1989,1,1),5),va='center',ha='left',size=18)
ax.set_ylim(-30,5)
ax.set_yticks(np.arange(-30, 5.01, step=5.0))
ax.set_ylabel('Change in Per Capita Cigarette Sales (in packs)',size=14)
ax.set_xlabel(' ')
ax.set_title('Synthetic Control Estimated Effect of Propostion 99\n',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
for r in smoking.state.unique():
subset = syn_control(r)
tau = subset[r] - subset['Synthetic Control']
if r=="California":
ax.plot(subset.index, tau,
linewidth=5.0, dashes=[5,5],label=r,color='blue')
else:
ax.plot(subset.index, tau)
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.axhline(y=0, xmin=0, xmax=1, color='gray', linestyle='--')
ax.legend(frameon=False,fontsize=18)
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1987,1,1),45),va='center',ha='right',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1989,1,1),45),va='center',ha='left',size=18)
ax.set_ylim(-50,50)
ax.set_yticks(np.arange(-50, 50.01, step=5.0))
ax.set_ylabel('Change in Per Capita Cigarette Sales (in packs)',size=14)
ax.set_xlabel(' ')
ax.set_title('Synthetic Control Estimates',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Looks promising but messy. There are a lot of outliers in the pre-period. Let's drop them.
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
for r in smoking.state.unique():
subset = syn_control(r)
tau = subset[r] - subset['Synthetic Control']
if r=="California":
ax.plot(subset.index, tau,
linewidth=5.0, dashes=[5,5],label=r,color='blue')
else:
stat = np.mean(tau[tau.index<dt.datetime(1988,1,1)]**2)
if stat < 5:
ax.plot(subset.index, tau)
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.axhline(y=0, xmin=0, xmax=1, color='gray', linestyle='--')
ax.legend(frameon=False,fontsize=18)
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1987,1,1),45),va='center',ha='right',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1989,1,1),45),va='center',ha='left',size=18)
ax.set_ylim(-50,50)
ax.set_yticks(np.arange(-50, 50.01, step=5.0))
ax.set_ylabel('Change in Per Capita Cigarette Sales (in packs)',size=14)
ax.set_xlabel(' ')
ax.set_title('Synthetic Control Estimates\n(Large Pre-treatment Effects Removed)',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Report the estimated effect of Proposition 99 on cigarette smoking in 2000.
subset = syn_control('California')
tau_LASSO = subset['California'] - subset['Synthetic Control']
print(f'Using LASSO to construct a synthetic control, \nthe estimated effect of Proposition 99 was to annual\nper capita packs by {abs(tau_LASSO[dt.datetime(2000,1,1)]):.2f} in 2000.')
Using LASSO to construct a synthetic control, the estimated effect of Proposition 99 was to annual per capita packs by 22.38 in 2000.
- Build a synthetic control group using the elastic net model. Report the estimated effect of Proposition 99 on cigarette smoking in 2000. Again, use 5-folds cross-validation to solve for the penalty parameters.
def syn_control_elastic(state):
train = smoking.copy()
train.drop(train[train.index > dt.datetime(1988,1,1)].index, inplace = True)
X = train.pivot(columns='state',values='cigsale') # Use existing index
Y = X[state]
X.drop(state, axis=1, inplace=True)
l1_grid = np.arange(.1,1,1e-2).tolist()
# Fit the elastic net model. I set the number of alpha candidates for each lambda value to 1000 (the default is 100)
res_elastic = linear_model.ElasticNetCV(l1_ratio=l1_grid,n_alphas=1000,cv=5).fit(X, Y)
# Add synthetic data to sample
X = smoking.pivot(columns='state',values='cigsale') # Use existing index
X.drop(state, axis=1, inplace=True)
control = pd.DataFrame(res_elastic.predict(X))
# Add column to original dataframe making sure that the indexes match
data = smoking.pivot(columns='state',values='cigsale') # Use existing index
control = control.set_index(data.index) # Transfer main index to this new df
data['Synthetic Control'] = control # Add column to main dataset
return data
subset = syn_control_elastic('California')
tau_elastic = subset['California'] - subset['Synthetic Control']
# Create Figure
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(subset.index, tau_elastic,color='blue')
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.axhline(y=0, xmin=0, xmax=1, color='gray', linestyle='--')
ax.axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax.annotate('Pre-Period',xy=(dt.datetime(1987,1,1),5),va='center',ha='right',size=18)
ax.annotate('Post-Period',xy=(dt.datetime(1989,1,1),5),va='center',ha='left',size=18)
ax.set_ylim(-30,5)
ax.set_yticks(np.arange(-30, 5.01, step=5.0))
ax.set_ylabel('Change in Per Capita Cigarette Sales (in packs)',size=14)
ax.set_xlabel(' ')
ax.set_title('Synthetic Control Estimated Effect of Propostion 99 (Elastic Net)\n',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
print(f'Using Elastic Net to construct a synthetic control, \nthe estimated effect of Proposition 99 was to annual\nper capita packs by {abs(tau[dt.datetime(2000,1,1)]):.2f} in 2000.')
print(f'\nThe effect was {abs(tau_LASSO[dt.datetime(2000,1,1)]):.2f} using LASSO.')
Using Elastic Net to construct a synthetic control, the estimated effect of Proposition 99 was to annual per capita packs by 16.96 in 2000. The effect was 22.38 using LASSO.
Constraining the Weights to Construct the Synthetic Control¶
We actually should have constrained the weights to be positive and sum to one. The idea is that we want to force the algorithm (ie, fitting exercise) to only do interpolatation. Restricting the weights to be positive and sum up one makes the synthetic control a convex combination of the units of candidate controls.
Practically, doing this procedure requires us to construct our own minimization routine -- which I do below.
from typing import List
from operator import add
from toolz import reduce, partial
from scipy.optimize import fmin_slsqp # minimize as quadratic programming problem
# Define loss function (sum of squared errors)
def loss_w(W, X, y) -> float:
return np.sqrt(np.mean((y - X.dot(W))**2))
# Define function to solve for weights with the condition that weights be positive and sum to one
def get_w(X, y):
w_start = [1/X.shape[1]]*X.shape[1]
weights = fmin_slsqp(partial(loss_w, X=X, y=y),
np.array(w_start),
f_eqcons=lambda x: np.sum(x) - 1, # constrain weights sum to one
bounds=[(0.0, 1.0)]*len(w_start), # constrain weights to be between zero and one
disp=False) # don't show progress
return weights
def syn_control_weights(state):
train = smoking.copy()
train.drop(train[train.index > dt.datetime(1988,1,1)].index, inplace = True)
X = train.pivot(columns='state',values='cigsale') # Use existing index
Y = X[state]
X.drop(state, axis=1, inplace=True)
# Fit the constrained model
calif_weights = get_w(X, Y)
# Add synthetic data to sample
X = smoking.pivot(columns='state',values='cigsale') # Use existing index
X.drop(state, axis=1, inplace=True)
control = pd.DataFrame(X.values.dot(calif_weights))
# Add column to original dataframe making sure that the indexes match
data = smoking.pivot(columns='state',values='cigsale') # Use existing index
control = control.set_index(data.index) # Transfer main index to this new df
data['Synthetic Control'] = control # Add column to main dataset
return data
subset = syn_control_weights('California')
tau = subset['California'] - subset['Synthetic Control']
# Create Figure
fig, ax = plt.subplots(2,1,figsize=(20,20))
#--------------------------------------------
# Time-series: Synthetic Control vs California
#--------------------------------------------
ax[0].plot(subset.index, subset['California'],color='blue',label='California (Treated)')
ax[0].plot(subset.index, subset['Synthetic Control'],color='red',label='Synthetic Control')
ax[0].legend(frameon=False)
ax[0].axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax[0].annotate('Pre-Period',xy=(dt.datetime(1987,1,1),140),va='center',ha='right',size=18)
ax[0].annotate('Post-Period',xy=(dt.datetime(1989,1,1),140),va='center',ha='left',size=18)
ax[0].legend(frameon=False,fontsize=18)
ax[0].set_ylim(0,150)
ax[0].set_yticks(np.arange(0, 150.01, step=10.0))
ax[0].set_ylabel('Per Capita Cigarette Sales (in packs)',size=14)
ax[0].set_xlabel(' ')
ax[0].set_title('Cigarette Sales Across Time\n(Synthetic Control with Constrained Weights)\n',size=20)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
#--------------------------------------------
# Time-series of Tau
#--------------------------------------------
ax[1].plot(subset.index, tau,color='black',label='Constrained')
ax[1].plot(subset.index, tau_lasso,color='blue',label='LASSO')
ax[1].plot(subset.index, tau_elastic,color='red',label='Elastic Net')
ax[1].legend(frameon=False,fontsize=18)
ax[1].axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax[1].axhline(y=0, xmin=0, xmax=1, color='gray', linestyle='--')
ax[1].axvline(x=dt.datetime(1988,1,1), ymin=0, ymax=1, color='black', linewidth=2.0, dashes=[5,5])
ax[1].annotate('Pre-Period',xy=(dt.datetime(1987,1,1),5),va='center',ha='right',size=18)
ax[1].annotate('Post-Period',xy=(dt.datetime(1989,1,1),5),va='center',ha='left',size=18)
ax[1].set_ylim(-30,5)
ax[1].set_yticks(np.arange(-30, 5.01, step=5.0))
ax[1].set_ylabel('Change in Per Capita Cigarette Sales (in packs)',size=14)
ax[1].set_xlabel(' ')
ax[1].set_title('Estimated Effect of Propostion 99 on Cigarette Purchases \n',size=20)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
plt.subplots_adjust(hspace=.3) # adjust white space between figures
plt.show()
print(f'The effect was {abs(tau_lasso[dt.datetime(2000,1,1)]):.2f} using LASSO.')
print(f'The effect was {abs(tau_elastic[dt.datetime(2000,1,1)]):.2f} using elastic net.')
print(f'The effect was {abs(tau[dt.datetime(2000,1,1)]):.2f} using constrained weights.')
The effect was 22.38 using LASSO. The effect was 24.10 using elastic net. The effect was 26.60 using constrained weights.