Lecture 19: Web APIs to Access Data [MY SOLUTIONS]

We have been loading data from files using read_csv() and read_excel(). A second way to input data to python/pandas is by directly downloading data from a web server through an application programming interface or api.

The wikipedia page isn't that insightful, but an api is a way to directly query a webserver and (in our case) ask for data. An api provides several advantages

You only download the data you need
You do not need to distribute data files with your code
You have access to the 'freshest data'

There are downsides, to using apis, too.

You need to be online to retrieve the data
The group hosting the data may 'revise' the data, making it difficult to replicate you results
The api may change, breaking your code.

You can think of an API as a way of sending and receiving messages via the internet. As such it sends and receives Hypertext Transfer Protocol (HTTP) request messages just like your web browser. For instance, if you send a bad request to an API (e..g, suppose the API does not exist), you get a '404' error message just like a browser. A successful web browser call will return the HTML code for your browser to render the information related to that site. A successful data API call will return a differnt kind of structured response messages usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. Both of these are formats for communicating data efficiently and resemble dicts. I mention this because up until now you were probably aware of .csv and .xlsx data formats but these are not very good formats for transferring data because they have a lot of useless information. .xml and .json are extremely common (as is .txt) because there's less superfluous stuff so data transfer is more efficient. Note that there is a .read_json pandas method so you can load json files into pandas directly. Accessing a url via read_json is also possible but not advisable.

On the whole, I find apis convenient, useful, and powerful. Let's dig in.

The outline of today's lecture is as follows:

Pandas Datareader

Requests

Class Announcements¶

I'll post Exam 2 on eLC on Sunday, and it's due by end of day Thursday, November 3, 2022

1. Pandas Datareader: The Simple Approach (top)¶

We'll start with the (kind of) built-in web api package: pandas_datareader which collects functions that interact with several popular data sources to access their apis list. Today, we'll cover

Yahoo! Finance
St. Louis Fed's FRED

You'll likely have to install datareader:

pip install pandas-datareader

In [1]:

import pandas as pd                       # pandas, shortened to pd
import numpy as np

# If you receive an error while trying to load data_reader try uncommenting the line below
# This is/was a problem with older version of pandas_datareader
# pd.core.common.is_list_like = pd.api.types.is_list_like

from pandas_datareader import data, wb    # we are grabbing the data and wb functions from the package
import matplotlib.pyplot as plt           # for plotting
import seaborn as sns
import datetime as dt                     # for time and date

API Keys¶

Many data providers do not want some anonymous account connecting to the api and downloading data. These providers ask you to create an account and you are given an api key that you pass along with your request. Often keys are free, sometimes they are not. If you use an API that requires a key, be careful not to make your key public.

In this notebook, we will go through a few examples that do not require api keys.

FRED¶

We've used the FRED database before. It's hosted by the St. Louis FRB and contains a lot of economic as well as financial data. It is US-centric but has some international data, too.

To use the FRED api you need to know the variable codes. The easiest way to do it to search on the FRED website.

The pandas_datareader documentation for FRED is here.

In [2]:

codes = ['GDPCA', 'LFWA64TTUSA647N']  # these codes are for real US gdp and the working age poplulation
                                      # the first code seems intuitive. the second does not
    
# We have the codes. Now go get the data. The DataReader() function returns a DataFrame
# Create datetime objects for the start date. If you do not spec an end date it returns up to the most
# recent date
start = dt.datetime(1970, 1, 1)
fred = data.DataReader(codes, 'fred', start)

fred.head()

Out[2]:

	GDPCA	LFWA64TTUSA647N
DATE
1970-01-01	4954.436	1.180775e+08
1971-01-01	5117.603	1.208098e+08
1972-01-01	5386.733	1.241022e+08
1973-01-01	5690.853	1.267081e+08
1974-01-01	5660.091	1.291758e+08

In [3]:

fred.columns = ['gdp', 'wap']           # give the variables some reasonable names

# Let's plot real gdp per working age person
fred['gdp_wap'] = fred['gdp']*1000000000/fred['wap']  # gdp data is in billions

fred.head()

Out[3]:

	gdp	wap	gdp_wap
DATE
1970-01-01	4954.436	1.180775e+08	41959.187822
1971-01-01	5117.603	1.208098e+08	42360.844220
1972-01-01	5386.733	1.241022e+08	43405.632187
1973-01-01	5690.853	1.267081e+08	44913.101440
1974-01-01	5660.091	1.291758e+08	43816.978032

In [4]:

fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['gdp_wap'], color='red')

ax.set_ylabel('2012 dollars')
ax.set_title('U.S. real GDP per working-age person')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Practice:¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

How has inflation in the United States evolved over the last 60 years? Let's investigate.

Go the FRED website and find the code for the 'Consumer price index for all urban consumers: All items less food and energy'
Use the api to get the data from 1960 to the most recent.

In [5]:

start = dt.datetime(1960, 1, 1)
fred = data.DataReader('CPILFESL', 'fred', start)
fred.head()

Out[5]:

	CPILFESL
DATE
1960-01-01	30.5
1960-02-01	30.6
1960-03-01	30.6
1960-04-01	30.6
1960-05-01	30.6

Create a variable in your DataFrame that contains the growth rate of the CPI --- the inflation rate. Compute the growth rate in percentage terms.

In [6]:

fred['inflation'] = fred['CPILFESL'].pct_change()*100
fred.head()

Out[6]:

	CPILFESL	inflation
DATE
1960-01-01	30.5	NaN
1960-02-01	30.6	0.327869
1960-03-01	30.6	0.000000
1960-04-01	30.6	0.000000
1960-05-01	30.6	0.000000

Plot it. What patterns do you see?

In [7]:

fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['inflation'], color='blue')

ax.set_ylabel('percent')
ax.set_title('U.S. CPI inflation')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

We computed the month-to-month inflation rate above. This is not the inflation rate we usually care about. Can you compute and plot the year-over-year inflation rate? For example, the inflation rate for 1962-05-01 would be the cpi in 1962-05-01 divided by the cpi in 1961-05-01. [Hint: You could do this with .resample() but you should also check the documentation for .pct_change().

In [8]:

fred['infl_year'] = fred['CPILFESL'].pct_change(periods=12)*100

fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['infl_year'], color='blue')

ax.set_ylabel('percent')
ax.set_title('U.S. CPI inflation')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Add the following annotations:
- Call the decrease in inflaton around 1983 as 'Volker disinflation'.
- Call the recent increase in inflaton 'Covid inflation'.

In [44]:

fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['infl_year'], color='blue')

ax.set_ylabel('percent')
ax.set_title('U.S. CPI inflation')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.text(dt.datetime(1983, 12,1), 12, 'Volker disinflation', ha='left')
ax.axvspan(dt.datetime(1980, 6,1),dt.datetime(1983, 10,1),color='red',alpha=.2)

ax.text(dt.datetime(2019, 12,1), 12, 'Covid inflation',ha='right')
ax.axvspan(dt.datetime(2020, 1,1),dt.datetime(2022, 10,1),color='red',alpha=.2)

plt.show()

2. Accessing APIs via Requests (top)¶

While pandas_datareader is convenient, there are a lot of webservers equiped with APIs which are not included with the package. The more general approach to accessing web servers is to use the requests package -- a package which we used last time for webscraping.

2.1 COVID19 Data¶

The Covid-19 Tracking project has a data api. Let's grab their data:

In [10]:

import requests                    # api module

url = 'https://api.covidtracking.com/v1/states/daily.json'
    
response = requests.get(url)

# Check out the HTML return
if response.status_code == 200: 
    print('Download successful')
    covid = pd.read_json(response.text)  # convert to dataframe
elif response.status_code == 301: 
    print('The server redirected to a different endpoint.')
elif response.status_code == 401: 
    print('Bad request.')
elif response.status_code == 401: 
    print('Authentication required.')
elif response.status_code == 403: 
    print('Access denied.')
elif response.status_code == 404: 
    print('Resource not found')
else: 
    print('Server busy')

# Clean-up data
covid['date'] = pd.to_datetime(covid['date'], format='%Y%m%d')
covid['month'] = covid['date'].dt.strftime('%B %Y') 
covid.drop(covid[covid['date'] < '2020-03-01'].index, inplace=True)   # Drop January and February when there are few cases

Download successful

Let's take a look:

In [11]:

covid.head()

Out[11]:

	date	state	positive	probableCases	negative	pending	totalTestResultsSource	totalTestResults	hospitalizedCurrently	hospitalizedCumulative	...	deathIncrease	hospitalizedIncrease	hash	month
0	2021-03-07	AK	56886.0	NaN	NaN	NaN	totalTestsViral	1731628.0	33.0	1293.0	...	0	0	dc4bccd4bb885349d7e94d6fed058e285d4be164	March 2021
1	2021-03-07	AL	499819.0	107742.0	1931711.0	NaN	totalTestsPeopleViral	2323788.0	494.0	45976.0	...	-1	0	997207b430824ea40b8eb8506c19a93e07bc972e	March 2021
2	2021-03-07	AR	324818.0	69092.0	2480716.0	NaN	totalTestsViral	2736442.0	335.0	14926.0	...	22	11	50921aeefba3e30d31623aa495b47fb2ecc72fae	March 2021
3	2021-03-07	AS	0.0	NaN	2140.0	NaN	totalTestsViral	2140.0	NaN	NaN	...	0	0	f77912d0b80d579fbb6202fa1a90554fc4dc1443	March 2021
4	2021-03-07	AZ	826454.0	56519.0	3073010.0	NaN	totalTestsViral	7908105.0	963.0	57907.0	...	5	44	0437a7a96f4471666f775e63e86923eb5cbd8cdf	March 2021

5 rows × 57 columns

Let's plot the time-series for several states to look for trends:

In [12]:

import matplotlib.pyplot as plt

subset = ['CA','GA','TX','ND','SD']

# Line graph
fig, ax = plt.subplots(figsize=(15,10))

# Figure 1: Raw Case Counts
for state in subset:
    dff = covid[covid['state']==state]
    ax.plot(
        dff.date,
        dff['positive'],
        label = state
    )

ax.set_ylabel('Total Number of Cases')  # add the y-axis label
ax.set_title('COVID-19 Cases by Select States Over Time')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top
ax.legend(frameon=False)              # Show the legend. frameon=False kills the box around the legend

# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

# Add a dashed horizontal line at y=0
ax.axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);

C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\2769691991.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

Practice¶

Pick a different set of states and plot them.

In [13]:

import matplotlib.pyplot as plt

subset = ['WI','NY','FL']

# Line graph
fig, ax = plt.subplots(figsize=(15,10))

# Figure 1: Raw Case Counts
for state in subset:
    dff = covid[covid['state']==state]
    ax.plot(
        dff.date,
        dff['positive'],
        label = state
    )

ax.set_ylabel('Total Number of Cases')  # add the y-axis label
ax.set_title('COVID-19 Cases by Select States Over Time')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top
ax.legend(frameon=False)              # Show the legend. frameon=False kills the box around the legend

# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

# Add a dashed horizontal line at y=0
ax.axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);

C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\1063602651.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

Create a new variable called 'new cases' using the .diff() method. Note that .diff() works just like .pct_change() and .shift() which we've used before.

In [14]:

covid = covid.sort_values(by=['state','date'])
covid.reset_index(inplace=True)
covid = covid[['date','state','positive']]

covid['new cases'] = covid.groupby('state')['positive'].diff().fillna(0).reset_index(0,drop=True)
covid['7-day rolling avg'] = covid.groupby('state')['new cases'].rolling(7).mean().fillna(0).reset_index(0,drop=True)
covid['14-day rolling avg'] = covid.groupby('state')['new cases'].rolling(14).mean().fillna(0).reset_index(0,drop=True)

covid[covid['state']=='GA'].head(10)

Out[14]:

	date	state	positive	new cases	7-day rolling avg
4043	2020-03-04	GA	2.0	0.0	0.000000
4044	2020-03-05	GA	2.0	0.0	0.000000
4045	2020-03-06	GA	2.0	0.0	0.000000
4046	2020-03-07	GA	6.0	4.0	0.000000
4047	2020-03-08	GA	7.0	1.0	0.000000
4048	2020-03-09	GA	12.0	5.0	0.000000
4049	2020-03-10	GA	17.0	5.0	2.142857
4050	2020-03-11	GA	22.0	5.0	2.857143
4051	2020-03-12	GA	31.0	9.0	4.142857
4052	2020-03-13	GA	42.0	11.0	5.714286

Create a 2-x-1 figure where the top figure is total cases and the bottom is new cases.

In [15]:

import matplotlib.pyplot as plt

subset = ['CA','GA','TX','ND','SD']

# Line graph
fig, ax = plt.subplots(2,1,figsize=(15,10))

# Figure 1: Raw Case Counts
for state in subset:
    dff = covid[covid['state']==state]
    ax[0].plot(
        dff.date,
        dff['positive'],
        label = state
    )

ax[0].set_ylabel('Total Number of Cases')  # add the y-axis label
ax[0].set_title('COVID-19 Cases by Select States Over Time')

ax[0].spines['right'].set_visible(False) # get ride of the line on the right
ax[0].spines['top'].set_visible(False)   # get rid of the line on top
ax[0].legend(frameon=False)              # Show the legend. frameon=False kills the box around the legend

# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()])

# Add a dashed horizontal line at y=0
ax[0].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);


# Figure 1: Raw Case Counts
for state in subset:
    dff = covid[covid['state']==state]
    ax[1].plot(
        dff.date,
        dff['7-day rolling avg'],
        label = state
    )

ax[1].set_ylabel('New COVID-19 Cases (7-day avg)')  # add the y-axis label
ax[1].set_title('New COVID-19 Cases by Select States Over Time (7-day Avg)')

ax[1].spines['right'].set_visible(False) # get ride of the line on the right
ax[1].spines['top'].set_visible(False)   # get rid of the line on top
ax[1].legend(frameon=False)              # Show the legend. frameon=False kills the box around the legend

# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])

# Add a dashed horizontal line at y=0
ax[1].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);

plt.show()

C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\569824982.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()])
C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\569824982.py:48: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])

Download population data from the following api

https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=state:*

In [16]:

url = 'https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=state:*'
    
response = requests.get(url)
population = pd.read_json(response.text)  # convert to dataframe

population.rename(columns=population.iloc[0],inplace=True)
population.drop(0,inplace=True)
population.drop(['state'], axis=1,inplace=True)

Add the two digit postal codes using 'state_abbrev.csv' (in the class data folder) to the population data.

In [17]:

abbrev = pd.read_csv('./Data/state_abbrev.csv',header=None)  # convert to dataframe
abbrev.rename(columns={0:'NAME',1:'state'},inplace=True)

population = population.merge(abbrev,on='NAME',how='outer')
population.drop(['NAME'], axis=1,inplace=True)
population['POP'] = population.POP.astype(float)
population['DENSITY'] = population.DENSITY.astype(float)

Add the population data to the covid df and create a variable of new cases per 10k residents. Repeat (3) using this per capita measure.

In [18]:

# Add population information
covid = covid.merge(population,on='state',how='inner')

covid['positive pc'] = 1e+4*covid['positive']/covid['POP'] 
covid['new cases pc'] = 1e+4*covid['new cases']/covid['POP'] 
covid['7-day rolling avg pc'] = 1e+4*covid['7-day rolling avg']/covid['POP'] 

In [19]:

covid.head()

Out[19]:

	date	state	positive	POP	DENSITY	positive pc
0	2020-03-06	AK	NaN	731545.0	1.281127	NaN
1	2020-03-07	AK	NaN	731545.0	1.281127	NaN
2	2020-03-08	AK	NaN	731545.0	1.281127	NaN
3	2020-03-09	AK	NaN	731545.0	1.281127	NaN
4	2020-03-10	AK	NaN	731545.0	1.281127	NaN

In [20]:

import matplotlib.pyplot as plt

subset = ['CA','GA','TX','ND','SD']

fig, ax = plt.subplots(2,1,figsize=(15,10))

for state in subset:
    dff = covid[covid['state']==state]
    ax[0].plot(
        dff.date,
        dff['positive pc'],
        label = state
    )

ax[0].set_ylabel('Total Number of Cases per 10K Residents')  # add the y-axis label
ax[0].set_title('COVID-19 Cases per 10K Residents by Select States Over Time')

ax[0].spines['right'].set_visible(False) # get ride of the line on the right
ax[0].spines['top'].set_visible(False)   # get rid of the line on top
ax[0].legend(frameon=False)              # Show the legend. frameon=False kills the box around the legend

# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()])

# Add a dashed horizontal line at y=0
ax[0].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);

for state in subset:
    dff = covid[covid['state']==state]
    ax[1].plot(
        dff.date,
        dff['7-day rolling avg pc'],
        label = state
    )

ax[1].set_ylabel('New COVID-19 Cases per 10K Residents (7-day avg)')  # add the y-axis label
ax[1].set_title('New COVID-19 Cases per 10K Residents by Select States Over Time (7-day Avg)')

ax[1].spines['right'].set_visible(False) # get ride of the line on the right
ax[1].spines['top'].set_visible(False)   # get rid of the line on top
ax[1].legend(frameon=False)              # Show the legend. frameon=False kills the box around the legend

# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])

# Add a dashed horizontal line at y=0
ax[1].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);

plt.show()

C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\1046391063.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()])
C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\1046391063.py:48: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])

2.2 S&P 500 Stock Prices¶

In the code below we access the wikipedia page which lists the S&P500 company tickers (link). We use requests to grab the html code sent to our computer, use the python package "Beautiful Soup" (bs4) to parse the html code, and grab the tickers. I wrote this as a function but frankly that's just to remind you how user-defined functions work.

In [21]:

import bs4 as bs

# Scrap sp500 tickers
def save_sp500_tickers():

    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'html')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        if not '.' in ticker:
            tickers.append(ticker.replace('\n',''))
        
    return tickers

tickers = save_sp500_tickers()

Now use the Yahoo! Finance API to download the stock history for each ticker in the S&P500. NB, this API isn't actually maintained by Yahoo! anymore and that requesting data prior to 1970 won't work. You'll have to install it:

pip install yfinance

In [22]:

import yfinance as yf

prices = yf.download(tickers, start='2018-01-01')['Adj Close'] 
prices.to_csv('stock_data.csv',index=False)

[*********************100%***********************]  501 of 501 completed

Generate Log-Returns¶

Define "log-returns" as the difference in log-value of an asset during time interval (t-1, t): $$ r_{it}= \log\bigg(\frac{p_t}{p_{t-1}}\bigg)=\log(p_t)-\log(p_{t-1}) $$ Recall that the log function is an approximation for percentage change. Log-returns are useful in quantitative finance for a number of reasons (stationarity, log-normality, etc.) and they're time-additive so computing cumulative returns (i.e. past performance) is easy. **Note: We'll see this math when we do Principal Component Analysis (PCA) later in the semester to estimate a Capital Asset Pricing Model (CAPM).**

Time to generate log-returns for the S&P500 stocks.

In [23]:

rs = prices.apply(np.log).diff(1)

Visualization¶

Let's generate a plot of log returns across time across the S&P500 stocks. There will bee a lot of lines and the end-product will look like a Jackson Pollock painting.

In [24]:

fig, ax = plt.subplots(figsize=(15,10)) 

rs.plot(ax=ax,legend=False)  # We can use pandas for plotting too!
ax.axhline(y=0,linestyle='--',color='gray')
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_title('Daily Returns of Companies in the S&P 500',fontsize=24,pad=20)
ax.set_facecolor('white')
plt.xticks(rotation = 0, size=18)
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
sns.despine(ax=ax)

plt.show()