Lecture 2: Forecasting [FINISHED]
Machine Learning is generally about prediction (i.e., we want good values for $\hat{y}$) though later in the semester we will see how ML tools can be used to address casaulity as well ($\hat{\beta}$). This notebook is a brief introduction to the time series models, identification, and methods. Often forecasting comes down to being able to estimate time-series models well.
The agenda for today's lecture is as follows:
import pandas as pd # for data handling
import numpy as np # for numerical methods and data structures
import matplotlib.pyplot as plt # for plotting
from matplotlib.ticker import FormatStrFormatter
import seaborn as sea # advanced plotting
1. Static Models (top)¶
Suppose that we have time series data available on two variables, say y and z, where $y_t$ and $z_t$ mean the data are contemporaneous (eg, same year). A static model relating y to z is
$$ y_t = \beta_0 + \beta_1z_t + \epsilon_{t} $$where $t=1,...,T$. We call this kind of model "static" since we think a change in z at time t is believed to have an immediate effect on y. Static regression models are also used when we are interested in knowing the tradeoff between y and z. While the models in the previous lecture were based on the cross-section (ie, we look at two different people which are observationally equivalent but one has more education and we look to see if one sleeps more than the other), data variation here is based on time.
Just as in the cross-sectional case, the usual inference procedures are only as good as the underlying assumptions. The classical linear model assumptions for time series data are much more restrictive than those for the cross-sectional data—in particular, the strict exogeneity and no serial correlation assumptions can be unrealistic. Nevertheless, this is a good framework to start.
Example: 3-month Treasuries¶
The data in INTDEF.dta
come from the 1997 Economic Report of the President and span
the years 1948 through 1996. The variable i3
is the three-month T-bill rate, inf
is the annual
inflation rate based on the consumer price index (CPI), and def
is the federal budget deficit
as a percentage of GDP. We're interested in understanding the connection between inflation and treasuries.
Let's load the data and clean things up a bit:
df = pd.read_stata('./Data/INTDEF.dta')
df.rename(columns={'i3':'treasuries','inf':'inflation','def':'deficit','def_1':'deficit_lag'}, inplace=True)
df.head()
year | treasuries | inflation | rec | out | deficit | i3_1 | inf_1 | deficit_lag | ci3 | cinf | cdef | y77 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1948 | 1.04 | 8.1 | 16.4 | 11.7 | -4.700000 | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
1 | 1949 | 1.10 | -1.2 | 14.6 | 14.4 | -0.200001 | 1.04 | 8.1 | -4.700000 | 0.06 | -9.3 | 4.499999 | 0 |
2 | 1950 | 1.22 | 1.3 | 14.5 | 15.6 | 1.100000 | 1.10 | -1.2 | -0.200001 | 0.12 | 2.5 | 1.300001 | 0 |
3 | 1951 | 1.55 | 7.9 | 16.1 | 14.2 | -1.900001 | 1.22 | 1.3 | 1.100000 | 0.33 | 6.6 | -3.000001 | 0 |
4 | 1952 | 1.77 | 1.9 | 18.9 | 19.4 | 0.500000 | 1.55 | 7.9 | -1.900001 | 0.22 | -6.0 | 2.400001 | 0 |
We're interested in the connection between inflation, treasuries, and the deficit so we estimate the following model:
$$ \text{inflation}_t = \beta_0 + \beta_1\text{treasuries}_t + \beta_2\text{deficit}_t + \epsilon_t $$We can estimate this equation using a standard OLS estimator:
import statsmodels.formula.api as smf
print(smf.ols('inflation ~ treasuries + deficit', data=df).fit().summary())
OLS Regression Results ============================================================================== Dep. Variable: inflation R-squared: 0.588 Model: OLS Adj. R-squared: 0.570 Method: Least Squares F-statistic: 32.87 Date: Fri, 14 Oct 2022 Prob (F-statistic): 1.36e-09 Time: 07:06:10 Log-Likelihood: -104.01 No. Observations: 49 AIC: 214.0 Df Residuals: 46 BIC: 219.7 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4382 0.595 0.737 0.465 -0.759 1.635 treasuries 0.9579 0.118 8.092 0.000 0.720 1.196 deficit -0.6399 0.172 -3.723 0.001 -0.986 -0.294 ============================================================================== Omnibus: 6.677 Durbin-Watson: 0.962 Prob(Omnibus): 0.035 Jarque-Bera (JB): 6.664 Skew: 0.899 Prob(JB): 0.0357 Kurtosis: 2.833 Cond. No. 12.8 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
These estimates show that increases in inflation and the relative size of the deficit work together to increase short-term interest rates, both of which are expected from basic economics. For example, a ceteris paribus one percentage point increase in the treasury rate is correlated with a 0.95 point increase in inflation
while a ceteris paribus increase in the deficit is correlated with a 0.63 point decrease in inflation. Both are very statistically significant but the coefficient on deficit is counter-intuitive.
2. Lagged Independent Variables (top)¶
It's not hard to write down a model (or use basic intuition) to see that inflation is a function past variables since people make purchase decisions basd on their expectations of the future economy. In a finite distributed lag (FDL) model, we allow one or more variables to affect y with a time lag.
df['treasuries_lag'] = df['treasuries'].shift(1) # create lagged treasuries
print(smf.ols('inflation ~ treasuries + treasuries_lag + deficit + deficit_lag', data=df).fit().summary())
OLS Regression Results ============================================================================== Dep. Variable: inflation R-squared: 0.614 Model: OLS Adj. R-squared: 0.578 Method: Least Squares F-statistic: 17.07 Date: Fri, 14 Oct 2022 Prob (F-statistic): 1.87e-08 Time: 07:06:10 Log-Likelihood: -100.04 No. Observations: 48 AIC: 210.1 Df Residuals: 43 BIC: 219.4 Df Model: 4 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------- Intercept 0.0900 0.612 0.147 0.884 -1.144 1.324 treasuries 0.8456 0.243 3.476 0.001 0.355 1.336 treasuries_lag 0.1175 0.268 0.438 0.664 -0.424 0.659 deficit -0.5750 0.285 -2.021 0.050 -1.149 -0.001 deficit_lag 0.0677 0.216 0.314 0.755 -0.367 0.502 ============================================================================== Omnibus: 8.592 Durbin-Watson: 0.759 Prob(Omnibus): 0.014 Jarque-Bera (JB): 8.300 Skew: 1.010 Prob(JB): 0.0158 Kurtosis: 3.266 Cond. No. 18.6 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Hmmm... lags don't seem to matter. I think we need a better theory to motivate this regression!
Example: Fertility and Economic Conditions¶
People write papers documenting that economic cycles are connected to fertility: When times are good, people have kids. I wonder how policy might explicitly be connectd to fertility. Let's find out. Consider the model
$$ \text{gfr}_t=\beta_0 + \beta_1\text{pe}_t + \beta_2\text{pe}_{t-1} + \beta_3\text{pe}_{t-2} + \epsilon_t $$where $\text{gfr}_t$ is the general fertility rate (children born per 1,000 women of childbearing age) and $\text{pe}_t$ is the real dollar value of the personal tax exemption. The idea is to see whether, in the aggregate, the decision to have children is linked to the tax value of having a child. The above equation recognizes that, for both biological and behavioral reasons, decisions to have children would not immediately result from changes in the personal tax exemption (or a change in disposable income more generally).
from statsmodels.iolib.summary2 import summary_col # Import a tool to make regression tables
df = pd.read_stata("./Data/FERTIL3.dta")
regf = smf.ols('gfr ~ pe + ww2 + pill', data=df).fit()
tsregf = smf.ols('gfr ~ pe + pe_1 + pe_2 + ww2 + pill', data=df).fit()
print(summary_col([regf, tsregf],stars=True,float_format='%0.3f',
model_names=['Static\n(b/se)','Lags\n(b/se)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.3f}".format(x.rsquared),
'Adj.R2':lambda x: "{:.3f}".format(x.rsquared_adj)}))
==================================== Static Lags (b/se) (b/se) ------------------------------------ Intercept 98.682*** 95.870*** (3.208) (3.282) R-squared 0.473 0.499 R-squared Adj. 0.450 0.459 pe 0.083*** 0.073 (0.030) (0.126) pe_1 -0.006 (0.156) pe_2 0.034 (0.126) pill -31.594*** -31.305*** (4.081) (3.982) ww2 -24.238*** -22.126** (7.458) (10.732) N 72 70 R2 0.473 0.499 Adj.R2 0.450 0.459 ==================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
Static Model¶
Let's take a look at Static model. Each variable is statistically significant at the 1% level against a two-sided alternative. We see that the fertility rate was lower during World War II: i.e., given pe
, there were about 24
fewer births for every 1,000 women of childbearing age, which is a large reduction since From 1913 through 1984, gfr
ranged from about 65 to 127. Similarly, the fertility rate has been substantially lower since the introduction of the birth control pill (i.e., negative sign and statistically signifcant).
The variable of economic interest is pe
. The average pe
over this time period is \$100.40 and ranges from zero to \\$243.83. The coefficient on pe
implies that a 12-dollar increase in pe
increases gfr
by about one birth per 1,000 women of childbearing age -- that's pretty big!
Lags Model¶
Now look at the Lags model. In this regression, we only have 70 observations because we lose two when we lag pe
twice. The coefficients on the pe
variables are estimated very imprecisely -- each one is individually insignificant. It turns out that there is substantial correlation between the lags (why?) this multicollinearity makes it difficult to estimate the effect at each lag.
hypotheses = '(pe = pe_1 = pe_2 = 0)'
f_test = tsregf.f_test(hypotheses)
print(f_test)
<F test: F=3.972964046978317, p=0.011652005303129494, df_denom=64, df_num=3>
We get an F statistic for joint-significant indicating with p-value of 1.2% so there's something going on but our crude lagged interpretation is not quite capturing it. Maybe we can model this better. Perhaps include a linear or exponential time trend.
3. Time Trends and Seasonality (top)¶
Many economic time series have a common tendency of growing over time. We must therefore recognize that some series contain a time trend in order to draw causal inference using time series data. Ignoring the fact that two sequences are trending in the same or opposite directions can lead us to falsely conclude that changes in one variable are actually caused by changes in another variable. In many cases, two time series processes appear to be correlated only because they are both trending over time for reasons related to other unobserved factors.
Let's load some data to illustrate this point:
df = pd.read_stata('./Data/EARNS.dta')
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(df.year, df['outphr'], color = 'red')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('')
ax.set_title('Labor Productivity (Output per Hour Worked)',size=28)
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
The above presents labor productivity (output per hour of work) in the US for the years 1947 through 1987. This series displays a clear upward trend reflecting the fact that workers have become more productive over time. What kind of statistical models adequately capture trending behavior?
$$ y_t = \beta_0 + \beta_1y_{t-1} + \epsilon_t $$The intrepretation is straightforward: If $\beta_1>0$, then, on average, $y_t$ is growing over time and therefore has an upward trend. Of course, the opposite is true when $\beta_1<0$. When $\beta_1=1$, we say this process exhibits a unit root -- the best prediction for the future is today's value.
Seasonality¶
If a time series is observed at monthly or quarterly intervals (or even weekly or daily), it may exhibit seasonality. Some examples:
- Monthly housing starts in the Midwest are strongly influenced by weather. While weather patterns are somewhat random, we can be sure that the weather during January will usually be more inclement than in June, and so housing starts are generally higher in June than in January.
- Retail sales in the fourth quarter are typically higher than in the previous three quarters because of the Christmas holiday. Again, this can be captured by allowing the average retail sales to differ over the course of a year. This is in addition to possibly allowing for a trending mean. That is, retail sales in the most recent first quarter were higher than retail sales in the fourth quarter from 30 years ago, because retail sales have been steadily growing. Nevertheless, if we compare average sales within a typical year, the seasonal holiday factor tends to make sales larger in the fourth quarter.
We can model these issues by including season fixed effects so to allow the expected value of the series, $y_t$ , to vary by month, quarter, etc.
An Example: Antidumping¶
Krupp and Pollard (1996) analyzed the effects of antidumping filings by U.S. chemical industries on imports of various chemicals. We focus here on one industrial chemical, barium chloride, a cleaning agent used in various chemical processes and in gasoline production. In the early 1980s, U.S. barium chloride producers believed that China was offering its U.S. imports at an unfairly low price (an action known as dumping), and the barium chloride industry filed a complaint with the U.S. International Trade Commission (ITC) in October 1983. The ITC ruled in favor of the U.S. barium chloride industry in October 1984.
There are several questions of interest in this case:
Are imports unusually high in the period immediately preceding the initial filing?
Do imports change noticeably after an antidumping filing?
What is the reduction in imports after a decision in favor of the U.S. industry?
df = pd.read_stata('./Data/BARIUM.dta')
antid_month = smf.ols('lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + feb + mar + apr + may + jun + jul + aug + sep + oct + nov + dec + 1', data=df).fit()
print(antid_month.summary())
OLS Regression Results ============================================================================== Dep. Variable: lchnimp R-squared: 0.358 Model: OLS Adj. R-squared: 0.262 Method: Least Squares F-statistic: 3.712 Date: Fri, 14 Oct 2022 Prob (F-statistic): 1.28e-05 Time: 07:06:13 Log-Likelihood: -109.54 No. Observations: 131 AIC: 255.1 Df Residuals: 113 BIC: 306.8 Df Model: 17 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 16.7788 32.429 0.517 0.606 -47.468 81.026 lchempi 3.2651 0.493 6.624 0.000 2.288 4.242 lgas -1.2781 1.389 -0.920 0.359 -4.030 1.474 lrtwex 0.6630 0.471 1.407 0.162 -0.271 1.597 befile6 0.1397 0.267 0.524 0.602 -0.389 0.668 affile6 0.0126 0.279 0.045 0.964 -0.539 0.565 afdec6 -0.5213 0.302 -1.726 0.087 -1.120 0.077 feb -0.4177 0.304 -1.372 0.173 -1.021 0.185 mar 0.0591 0.265 0.223 0.824 -0.465 0.584 apr -0.4515 0.268 -1.682 0.095 -0.983 0.080 may 0.0333 0.269 0.124 0.902 -0.500 0.567 jun -0.2063 0.269 -0.766 0.445 -0.740 0.327 jul 0.0038 0.279 0.014 0.989 -0.548 0.556 aug -0.1571 0.278 -0.565 0.573 -0.708 0.394 sep -0.1342 0.268 -0.501 0.617 -0.664 0.396 oct 0.0517 0.267 0.194 0.847 -0.477 0.580 nov -0.2463 0.263 -0.937 0.351 -0.767 0.274 dec 0.1328 0.271 0.489 0.626 -0.405 0.671 ============================================================================== Omnibus: 9.169 Durbin-Watson: 1.325 Prob(Omnibus): 0.010 Jarque-Bera (JB): 9.324 Skew: -0.540 Prob(JB): 0.00945 Kurtosis: 3.736 Cond. No. 1.47e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.47e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Some comments:
- The dependent variable is the (logged) volume of imports of barium chloride from China
chnimp
.
- The equation shows that
befile6
(ie, a dummy variable equal to one during the six months before filing) is statistically insignificant so there is no evidence that Chinese imports were unusually high during the six months before the suit was filed.
- Although the estimate on
affile6
(ie, a dummy variable equal to one during the six after filing) is negative, the coefficient is small (indicating about a 1.3% fall in Chinese imports), and it is statistically very insignificant.
- The coefficient on
afdec6
shows a substantial fall in Chinese imports of barium chloride after the decision in favor of the U.S. industry. How much?
- Are there seasonality concerns? It could be that the months just before the suit was filed are months where imports are higher or lower, on average, than in other months.
hypotheses = '(feb = mar = apr = may = jun = jul = aug = sep = oct = nov = dec= 0)'
f_test = antid_month.f_test(hypotheses)
# Print the results with an f-string
print(f'The F-statistic for all months equal to zero is {f_test.fvalue:.2f} implying we can reject the null hypothesis that they are\njointly zero with probability {1-f_test.pvalue:.2f}.')
The F-statistic for all months equal to zero is 0.86 implying we can reject the null hypothesis that they are jointly zero with probability 0.41.
A test of the joint significance of the 11 month dummies generates a low F-statistic (0.86) with a p-value 0.59 and so the seasonal dummies are jointly insignificant.
Practice¶
- Re-estimate the above model with seasonal dummies (Spring, Summer, Fall, Winter). Hint: Look at the data.
df.dtypes
chnimp float32 bchlimp float32 befile6 int8 affile6 int8 afdec6 int8 befile12 int8 affile12 int8 afdec12 int8 chempi float32 gas float32 rtwex float32 spr int8 sum int8 fall int8 lchnimp float32 lgas float32 lrtwex float32 lchempi float32 t int16 feb int8 mar int8 apr int8 may int8 jun int8 jul int8 aug int8 sep int8 oct int8 nov int8 dec int8 percchn float32 dtype: object
antid_season = smf.ols('lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + spr + sum + fall', data=df).fit()
print(antid_season.summary())
OLS Regression Results ============================================================================== Dep. Variable: lchnimp R-squared: 0.310 Model: OLS Adj. R-squared: 0.258 Method: Least Squares F-statistic: 6.032 Date: Fri, 14 Oct 2022 Prob (F-statistic): 5.79e-07 Time: 07:06:13 Log-Likelihood: -114.33 No. Observations: 131 AIC: 248.7 Df Residuals: 121 BIC: 277.4 Df Model: 9 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -26.5219 23.297 -1.138 0.257 -72.645 19.602 lchempi 3.0779 0.486 6.331 0.000 2.116 4.040 lgas 0.5651 1.000 0.565 0.573 -1.415 2.545 lrtwex 1.1015 0.425 2.594 0.011 0.261 1.942 befile6 0.0767 0.265 0.289 0.773 -0.448 0.601 affile6 -0.0833 0.273 -0.305 0.761 -0.623 0.457 afdec6 -0.6212 0.295 -2.103 0.038 -1.206 -0.036 spr -0.0412 0.151 -0.273 0.786 -0.341 0.258 sum -0.1519 0.169 -0.897 0.371 -0.487 0.183 fall -0.0673 0.154 -0.436 0.664 -0.373 0.239 ============================================================================== Omnibus: 8.751 Durbin-Watson: 1.439 Prob(Omnibus): 0.013 Jarque-Bera (JB): 9.596 Skew: -0.466 Prob(JB): 0.00825 Kurtosis: 3.943 Cond. No. 1.06e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.06e+04. This might indicate that there are strong multicollinearity or other numerical problems.
- Test whether these seasonal variables are jointly-significant.
hypotheses = '(spr = sum = fall= 0)'
f_test = antid_season.f_test(hypotheses)
print(f_test)
<F test: F=0.28224471031401044, p=0.8381333029196874, df_denom=121, df_num=3>
- Report the reduction in Chinese imports after the decision in favor of US firms.
print(f'Chinese imports fell by {100*(np.exp(0.6212)-1):4.2f} percent in the six months after the US ITC October 1984 decision.')
Chinese imports fell by 86.12 percent in the six months after the US ITC October 1984 decision.
4. Stationarity and Autoregressive (AR) Models (top)¶
Let's expand our analysis of the linear trend models to look at autoregressive models. Here, we'll take advantage of statsmodels.tsa which contains model classes and functions useful for time series analysis. Basic models include univariate autoregressive models (AR), vector autoregressive models (VAR) and univariate autoregressive moving average models (ARMA). We are only going to touch on autoregressive models here. The package also includes filtering methods and time series related plotting functions.
We'll focuson on autoregressive models (AR) which amount to modelling a specific type of linear trend. Such models are particularly useful in looking at economic data, such as those data available from FRED. You may have to install the pandas API interface via
pip install pandas_datareader
import statsmodels.graphics.tsaplots as tsaplots # Gives the the autocorrelation plot
import statsmodels.tsa as tsa # The time series models
from statsmodels.tsa.ar_model import AutoReg
from pandas_datareader import data, wb # we are grabbing the data and wb functions from the package
import datetime as dt # for time and date
codes = ['GDPC1'] # The code for real GDP at FRED. You can go to the website and look it up.
start = dt.datetime(1947, 1, 1)
usa = data.DataReader(codes, 'fred', start)
usa.rename(columns={'GDPC1':'gdp'}, inplace=True)
It's always a good idea to plot your data before doing any serious stuff. Look for variations in the data which you can use to build a better model.
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(usa.index, usa['gdp'], color = 'red')
ax.set_ylabel('bil 2012 dollars')
# Make it look nice
sea.despine(ax=ax)
ax.set_title('US GDP',size=28)
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
A different version/ source of labor productivity and again we observe a linear trend.
Stationarity¶
The GDP data are not stationary --- there is clearly a trend. Autoregressive models are usually used on stationary data. We can filter the data to remove the trend and recover a stationary series (tsa has methods to to this), but GDP is usually stationary in growth rates.
To see this, let's compute and plot the growth rate of the GDP data.
# The log-difference is a close approximation of the growth rate when growth rates are small.
# log-differences are often used in practice because they are symmetric.
usa['gdp_diff'] = np.log(usa['gdp']) - np.log(usa['gdp'].shift(1))
usa['gdp_pct'] = usa['gdp'].pct_change()
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(usa.index, usa['gdp_diff'], color = 'red')
ax.plot(usa.index, usa['gdp_pct'], color = 'blue')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
Note that the growth percent and log difference plots lie on-top of eachother. This illustrates the idea that logs can be used as a close approximating to growth rates and percentage changes.
The figure looks stationary; i.e., no trend. Even the COVID19 period seems to revolve around a trend. There are things like Dickey-Fuller tests for stationarity (again, these can be done in statsmodels) but they are outside of the scope of this notebook.
Autoregressive (AR) models¶
An autoregressive model is a model where the outcome variable is a function of its past values. The general model is
$$ y_t = \varphi_0 + \varphi_1y_{t-1} + \varphi_2 y_{t-2} + \cdots + \varphi_p y_{t-p}+\epsilon_t.$$This model has $p$ lags. We refer to it as an AR(p) model or an autoregressive model of order $p$.
Determining the number of lags to include is part of specifying the model.
Let's start by looking at the data to see if it has an autoregressive property. Is there a relationship between the current value of GDP growth and its past value?
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(usa['gdp_diff'], usa['gdp_diff'].shift(1), color = 'red')
# plot the 45 degree line
ax.plot([-0.02, 0.045], [-0.02, 0.045], color='black')
ax.text(0.04, 0.037, '45-degree line')
ax.set_ylabel('gdp(t)')
ax.set_xlabel('gdp(t-1)')
sea.despine(ax=ax, trim=True)
plt.show()
There appears to be a positive relationship, but it is noisy.
The autocorrelation function¶
Our plot above tells us about the $t$ and $t-1$ relationship. How about $t$ and $t-2$? $t-5$? We could continue to make the plots above, but there are better ways.
The autocorrelation function plots $\text{corr}(y_t, y_{t-k})$ for many values of $k$. statsmodels automates this for us with the .plot_acf()
function (docs).
import statsmodels.graphics.tsaplots as tsaplots # Gives the the autocorrelation plot
import statsmodels.tsa as tsa # The time series models
from statsmodels.tsa.ar_model import AutoReg
fig, ax = plt.subplots(figsize=(10,6))
# plot_acf is picky about nans, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_acf(usa['gdp_diff'].dropna(), lags = 8, ax = ax)
ax.set_xlabel('lag = k', fontsize=14)
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
Estimating the AR(p) model¶
The AR(p) model in statsmodels works like the other models. You create a model object, then you call the .fit( )
method to estimate the parameters. We'll first estimate the model with two lags using the standard statsmodels approach:
print(smf.ols('gdp_diff ~ usa["gdp_diff"].shift(1) + usa["gdp_diff"].shift(2)', data=usa).fit().summary())
OLS Regression Results ============================================================================== Dep. Variable: gdp_diff R-squared: 0.026 Model: OLS Adj. R-squared: 0.019 Method: Least Squares F-statistic: 3.946 Date: Fri, 14 Oct 2022 Prob (F-statistic): 0.0204 Time: 07:06:30 Log-Likelihood: 914.88 No. Observations: 299 AIC: -1824. Df Residuals: 296 BIC: -1813. Df Model: 2 Covariance Type: nonrobust ============================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------- Intercept 0.0060 0.001 6.787 0.000 0.004 0.008 usa["gdp_diff"].shift(1) 0.1010 0.058 1.749 0.081 -0.013 0.215 usa["gdp_diff"].shift(2) 0.1146 0.058 1.985 0.048 0.001 0.228 ============================================================================== Omnibus: 118.366 Durbin-Watson: 1.984 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6450.531 Skew: -0.731 Prob(JB): 0.00 Kurtosis: 25.708 Cond. No. 92.4 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now, let's shift and use the AutoReg()
method. Note that we imported the method above. The only real difference here is that you do not need to write out a string with the model specification. The AutoReg( )
method knows that the model is AR(p). NB, you can include other regressors using the exog
argument.
# Need to set the data frequency.
usa = usa.resample('q').mean()
# Estimate the model.
mod = AutoReg(usa['gdp_diff'].dropna(), 2)
res = mod.fit()
print(res.summary())
AutoReg Model Results ============================================================================== Dep. Variable: gdp_diff No. Observations: 301 Model: AutoReg(2) Log Likelihood 914.883 Method: Conditional MLE S.D. of innovations 0.011 Date: Fri, 14 Oct 2022 AIC -1821.765 Time: 07:06:30 BIC -1806.964 Sample: 12-31-1947 HQIC -1815.841 - 06-30-2022 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- const 0.0060 0.001 6.821 0.000 0.004 0.008 gdp_diff.L1 0.1010 0.057 1.758 0.079 -0.012 0.214 gdp_diff.L2 0.1146 0.057 1.995 0.046 0.002 0.227 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 2.5458 +0.0000j 2.5458 0.0000 AR.2 -3.4267 +0.0000j 3.4267 0.5000 -----------------------------------------------------------------------------
We see the same coefficient estimates as the standard statsmodels
approach.
Comments: The stuff at the bottom is outside the scope of this class but in case you're curious... the modulus refers to the eigenvalues of the stochastic process we're modeling. This speaks to whether or not a shock would result in the process returning to an equilibrium. Eigenvalues / modulus outside the unit circle (ie, absolute value greater than one) imply such stability. Since we observe business cycles in the US economy revolving around a trend, a "smell" test for whether a model makes sense is whether the eigenvalues imply stability. Of course, since the data do reflect this stability, such an an outcome should be expected.
The results object has lots of stuff in it. Try tab
completion to see what's in there. Let's plot the fitted values:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(res.fittedvalues, color='blue', label = 'AR(2)')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'Data')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
ax.legend(frameon=False,fontsize=18)
plt.show()
The fitted values track the data, but do not generate enough volatility.
Forecasting¶
We can use the estimated model to develop a forecast simply by iterating through the future and applying our fitted model.
# To make plotting easier, initialize forecast_list to current value of gdp.
# This will simply connect the lines for the historical data and the forecast.
forecast_list = []
forecast_list.append(float(usa.loc['2022-06-30','gdp_diff']))
usa['L1gdp_diff'] = usa["gdp_diff"].shift(1)
usa['L2gdp_diff'] = usa["gdp_diff"].shift(2)
Xvars = ['L1gdp_diff','L2gdp_diff']
Xvars_string = ' + '.join(Xvars)
# Series containing current values of the explanatory variables. Used to make forecast
current_values = usa.loc['2022-06-30',Xvars]
# Create new columns to contain forward values of the GDP growth rate:
for h in range(1,10):
var = 'F' + str(h) + 'gdp_diff' # Create string containing the variable name, like "F1gdp"
usa[var] = usa['gdp_diff'].shift(-h) # Create column in data frame
# Loop through each forecast horizon:
for h in range(1,10):
# Define dependant variable
var = 'F' + str(h) + 'gdp_diff'
# Use same trick as above to conveniently write the expression for the model:
res = smf.ols(var + ' ~ ' + Xvars_string, data=usa).fit()
# Print results (if necessary)
if h == 1:
print(res.summary()) # Print results summary for first regression only
# Record forecast
pt_forecast = float(res.predict(current_values)) # The .predict() method creates the forecast of GDP.
forecast_list.append(pt_forecast) # Append the forecast for the current horizon to the list of forecasts
print('')
print(forecast_list)
OLS Regression Results ============================================================================== Dep. Variable: F1gdp_diff R-squared: 0.017 Model: OLS Adj. R-squared: 0.010 Method: Least Squares F-statistic: 2.522 Date: Fri, 14 Oct 2022 Prob (F-statistic): 0.0820 Time: 07:06:33 Log-Likelihood: 910.16 No. Observations: 298 AIC: -1814. Df Residuals: 295 BIC: -1803. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.0067 0.001 7.589 0.000 0.005 0.008 L1gdp_diff 0.1306 0.058 2.246 0.025 0.016 0.245 L2gdp_diff -0.0179 0.058 -0.308 0.758 -0.132 0.097 ============================================================================== Omnibus: 141.061 Durbin-Watson: 1.791 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5274.083 Skew: -1.209 Prob(JB): 0.00 Kurtosis: 23.467 Cond. No. 92.5 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [-0.0014473866859976425, 0.005910242683759876, 0.006888434724432077, 0.007716174337970049, 0.007819723553806616, 0.008022337924499738, 0.00832140520106706, 0.008631598715385247, 0.007503914856584633, 0.006786637591904043]
forecast_dates = pd.date_range('2022-06-30', periods=10, freq='Q')
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(res.fittedvalues, color='blue', label = 'AR(2)')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'Data')
ax.plot(forecast_dates, forecast_list, color = 'red', linestyle = '-', label = 'Forecast')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
ax.legend(frameon=False,fontsize=18)
plt.show()
5. Autoregressive Integrated Moving-Average (ARIMA) (top)¶
Let's get serious about modeling GDP growth and fit an Autoregressive Integrated Moving-Average (ARMA) model. The "integrated" part refers to the fact that the model provides a way to choose the number of differences required to make the data stationary but we've already done that so what we'll be modelling is a Autoregressive Moving-Average (ARMA) model defined as
$$ Y_t = \varepsilon_t + \underbrace{\Sigma_{i=1}^p \varphi_i Y_{i-1}}_{\text{Auto-regressive}} + \underbrace{\Sigma_{i=1}^q \theta_{i}\varepsilon_{i-1}}_{\text{Moving-Average}} $$We need to choose the variables $p$ and $q$ and then estimate (fit) ${\varphi_i}_{i=1}^p$ and ${\theta_i}_{i=1}^q$. Fortunately, there's a recipe to chose (p,q):
- Plot the autocorrelation functions for an estimate of q. We've already done this above and found the autocorrelation was significant for a lag of two so
q=2
. We repeat the process and graph below.
fig, ax = plt.subplots(figsize=(10,6))
# plot_acf is also picky about nans, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_acf(usa['gdp_diff'].dropna(), lags = 10, ax = ax)
ax.set_xlabel('lag number', fontsize=14)
plt.xticks(np.arange(0, 11, 1.0))
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
- Plot the partial autocorrelation functions for an estimate of p. This is beyond what you need to know for this class but in case you're interested:
Given a time series $\{z_{t}\}$, the partial autocorrelation of lag k, denoted $\phi _{k,k}$, is the autocorrelation between $z_{t}$ and $z_{t+k}$ with the linear dependence of $z_{t}$ on $z_{t+1}$ through $z_{t+k-1}$ removed. In a slightly less mathy tone, the partial autocorrelation of lag k, denoted $\phi _{k,k}$, is the autocorrelation between $z_{t}$ and $z_{{t+k}}$ not accounted for by lags $1$ through (and including) $k-1$. For us, it will serve as a tool to identify how mant lags to include for the AR component.
We haven't done this yet so let's do it.
fig, ax = plt.subplots(figsize=(10,6))
# plot_pcf is also picky about nans, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_pacf(usa['gdp_diff'].dropna(), lags = 10, ax = ax,method='ywm')
ax.set_xlabel('lag number', fontsize=14)
plt.xticks(np.arange(0, 11, 1.0))
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
Hmm... very similar to the autoregressive model and the number of significant lags is equal to two so set p=2
. And now we can fit the model:
import statsmodels.api as sm
arima_model = sm.tsa.arima.ARIMA(usa['gdp_diff'], order=(2,0,2)) # The "zero" in "order" is for the number of differences to
# but we've already done that so I'm telling the algorithm to
# not apply a difference.
res = arima_model.fit()
A nice feature of statsmodels
is that it's easy to use the fitted model to make a forecaste so let's do that:
# Forecast
fc = res.forecast(15, alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.DataFrame(fc)
And of course it would be useful to plot the results:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(res.fittedvalues, color='blue', label = 'ARIMA')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'Data')
ax.plot(fc_series.index, fc_series.predicted_mean, color='red', label = 'Forecast')
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
ax.legend(frameon=False,fontsize=18)
plt.show()
Still not enough volatility captured in our model.
Practice¶
- Collect historical data on the S\&P 500 price index from FRED using the pandas-datareader API. The FRED variable code is 'SP500'. See the API notebook for how to do this. Use November 23, 2013 as the start date.
codes = ['SP500'] # The code for real GDP at FRED.
start = start = dt.datetime(2013, 11, 23)
stocks = data.DataReader(codes, 'fred', start)
- Resample the data to weekly frequency using
mean()
.
stocks = stocks.resample('w').mean()
- Plot the weekly average price. Are the data stationary?
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stocks.index, stocks['SP500'], color = 'blue')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('')
ax.set_title('S&P 500 Index Price',size=28)
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
- Compute the log difference of the weekly price data.
#stocks['sp500_diff'] = np.log(stocks['SP500']) - (np.log(stocks['SP500'].shift(1))) # Approximation using logs
stocks['sp500_diff'] = stocks['SP500'].pct_change() # Calculated directly
stocks.head()
SP500 | sp500_diff | |
---|---|---|
DATE | ||
2013-12-01 | 1804.5675 | NaN |
2013-12-08 | 1795.7960 | -0.004861 |
2013-12-15 | 1788.8060 | -0.003892 |
2013-12-22 | 1801.2220 | 0.006941 |
2013-12-29 | 1836.1825 | 0.019409 |
- Plot the growth rates. Are the data stationary?
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stocks.index, stocks['sp500_diff'], color = 'blue')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('')
ax.set_title('S&P 500 Index Growth Rate',size=28)
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
- Plot the autocorrelation function for 20 lags. What lags look meaningful?
fig, ax = plt.subplots(figsize=(10,6))
# plot_acf is picky about nas, so drop them
tsaplots.plot_acf(stocks['sp500_diff'].dropna(), lags = 20, ax = ax)
ax.set_xlabel('lag = k', fontsize=14)
ax.set_ylabel(r'corr(sp500$_t$, sp500$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
- Estimate an AR(p) model of the price data. Include 1 lag.
stock_res_1 = AutoReg(stocks['sp500_diff'].dropna(),1).fit()
print(stock_res_1.summary())
AutoReg Model Results ============================================================================== Dep. Variable: sp500_diff No. Observations: 463 Model: AutoReg(1) Log Likelihood 1206.886 Method: Conditional MLE S.D. of innovations 0.018 Date: Fri, 14 Oct 2022 AIC -2407.773 Time: 07:06:46 BIC -2395.366 Sample: 12-15-2013 HQIC -2402.888 - 10-16-2022 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- const 0.0013 0.001 1.531 0.126 -0.000 0.003 sp500_diff.L1 0.2377 0.045 5.241 0.000 0.149 0.327 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 4.2072 +0.0000j 4.2072 0.0000 -----------------------------------------------------------------------------
- Plot the data and the fitted values.
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stock_res_1.fittedvalues, color='blue', label = 'fitted values')
ax.plot(stocks.index, stocks['sp500_diff'], color='black', alpha = 0.5, label = 'data')
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('S&P500 growth rates',size=28)
ax.legend(frameon=False,fontsize=18)
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
- Plot the data and the fitted values from the 1-lag model and the 2-lag model. Plot just for the period 1/1/2015 through 1/1/2017.
stock_res_2 = AutoReg(stocks['sp500_diff'].dropna(),2).fit()
print(stock_res_2.summary())
AutoReg Model Results ============================================================================== Dep. Variable: sp500_diff No. Observations: 463 Model: AutoReg(2) Log Likelihood 1205.415 Method: Conditional MLE S.D. of innovations 0.018 Date: Fri, 14 Oct 2022 AIC -2402.830 Time: 07:06:47 BIC -2386.297 Sample: 12-22-2013 HQIC -2396.320 - 10-16-2022 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- const 0.0014 0.001 1.669 0.095 -0.000 0.003 sp500_diff.L1 0.2578 0.047 5.529 0.000 0.166 0.349 sp500_diff.L2 -0.0841 0.047 -1.800 0.072 -0.176 0.007 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 1.5339 -3.0894j 3.4493 -0.1767 AR.2 1.5339 +3.0894j 3.4493 0.1767 -----------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stock_res_2.fittedvalues, color='blue', label = 'fitted values, p = 2')
ax.plot(stock_res_1.fittedvalues, color='red', label = 'fitted values, p = 1')
ax.plot(stocks.index, stocks['sp500_diff'], color='black', alpha = 0.5, label = 'data')
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('S&P500 growth rates (2015-2017)',size=28)
ax.legend(frameon=False,fontsize=18)
ax.set_xlim([dt.datetime(2015, 1, 1), dt.datetime(2017, 1, 1)])
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()