Lecture 19: Web APIs to Access Data [MY SOLUTIONS]
We have been loading data from files using read_csv()
and read_excel()
. A second way to input data to python/pandas is by directly downloading data from a web server through an application programming interface or api.
The wikipedia page isn't that insightful, but an api is a way to directly query a webserver and (in our case) ask for data. An api provides several advantages
- You only download the data you need
- You do not need to distribute data files with your code
- You have access to the 'freshest data'
There are downsides, to using apis, too.
- You need to be online to retrieve the data
- The group hosting the data may 'revise' the data, making it difficult to replicate you results
- The api may change, breaking your code.
You can think of an API as a way of sending and receiving messages via the internet. As such it sends and receives Hypertext Transfer Protocol (HTTP) request messages just like your web browser. For instance, if you send a bad request to an API (e..g, suppose the API does not exist), you get a '404' error message just like a browser. A successful web browser call will return the HTML code for your browser to render the information related to that site. A successful data API call will return a differnt kind of structured response messages usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. Both of these are formats for communicating data efficiently and resemble dicts
. I mention this because up until now you were probably aware of .csv
and .xlsx
data formats but these are not very good formats for transferring data because they have a lot of useless information. .xml
and .json
are extremely common (as is .txt
) because there's less superfluous stuff so data transfer is more efficient. Note that there is a .read_json
pandas method so you can load json
files into pandas directly. Accessing a url via read_json
is also possible but not advisable.
On the whole, I find apis convenient, useful, and powerful. Let's dig in.
The outline of today's lecture is as follows:
Class Announcements¶
I'll post Exam 2 on eLC on Sunday, and it's due by end of day Thursday, November 3, 2022
1. Pandas Datareader: The Simple Approach (top)¶
We'll start with the (kind of) built-in web api package: pandas_datareader
which collects functions that interact with several popular data sources to access their apis list. Today, we'll cover
- Yahoo! Finance
- St. Louis Fed's FRED
You'll likely have to install datareader:
pip install pandas-datareader
import pandas as pd # pandas, shortened to pd
import numpy as np
# If you receive an error while trying to load data_reader try uncommenting the line below
# This is/was a problem with older version of pandas_datareader
# pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data, wb # we are grabbing the data and wb functions from the package
import matplotlib.pyplot as plt # for plotting
import seaborn as sns
import datetime as dt # for time and date
API Keys¶
Many data providers do not want some anonymous account connecting to the api and downloading data. These providers ask you to create an account and you are given an api key that you pass along with your request. Often keys are free, sometimes they are not. If you use an API that requires a key, be careful not to make your key public.
In this notebook, we will go through a few examples that do not require api keys.
FRED¶
We've used the FRED database before. It's hosted by the St. Louis FRB and contains a lot of economic as well as financial data. It is US-centric but has some international data, too.
To use the FRED api you need to know the variable codes. The easiest way to do it to search on the FRED website.
The pandas_datareader documentation for FRED is here.
codes = ['GDPCA', 'LFWA64TTUSA647N'] # these codes are for real US gdp and the working age poplulation
# the first code seems intuitive. the second does not
# We have the codes. Now go get the data. The DataReader() function returns a DataFrame
# Create datetime objects for the start date. If you do not spec an end date it returns up to the most
# recent date
start = dt.datetime(1970, 1, 1)
fred = data.DataReader(codes, 'fred', start)
fred.head()
GDPCA | LFWA64TTUSA647N | |
---|---|---|
DATE | ||
1970-01-01 | 4954.436 | 1.180775e+08 |
1971-01-01 | 5117.603 | 1.208098e+08 |
1972-01-01 | 5386.733 | 1.241022e+08 |
1973-01-01 | 5690.853 | 1.267081e+08 |
1974-01-01 | 5660.091 | 1.291758e+08 |
fred.columns = ['gdp', 'wap'] # give the variables some reasonable names
# Let's plot real gdp per working age person
fred['gdp_wap'] = fred['gdp']*1000000000/fred['wap'] # gdp data is in billions
fred.head()
gdp | wap | gdp_wap | |
---|---|---|---|
DATE | |||
1970-01-01 | 4954.436 | 1.180775e+08 | 41959.187822 |
1971-01-01 | 5117.603 | 1.208098e+08 | 42360.844220 |
1972-01-01 | 5386.733 | 1.241022e+08 | 43405.632187 |
1973-01-01 | 5690.853 | 1.267081e+08 | 44913.101440 |
1974-01-01 | 5660.091 | 1.291758e+08 | 43816.978032 |
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['gdp_wap'], color='red')
ax.set_ylabel('2012 dollars')
ax.set_title('U.S. real GDP per working-age person')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Practice:¶
Take a few minutes and try the following. Feel free to chat with those around if you get stuck.
How has inflation in the United States evolved over the last 60 years? Let's investigate.
- Go the FRED website and find the code for the 'Consumer price index for all urban consumers: All items less food and energy'
- Use the api to get the data from 1960 to the most recent.
start = dt.datetime(1960, 1, 1)
fred = data.DataReader('CPILFESL', 'fred', start)
fred.head()
CPILFESL | |
---|---|
DATE | |
1960-01-01 | 30.5 |
1960-02-01 | 30.6 |
1960-03-01 | 30.6 |
1960-04-01 | 30.6 |
1960-05-01 | 30.6 |
- Create a variable in your DataFrame that contains the growth rate of the CPI --- the inflation rate. Compute the growth rate in percentage terms.
fred['inflation'] = fred['CPILFESL'].pct_change()*100
fred.head()
CPILFESL | inflation | |
---|---|---|
DATE | ||
1960-01-01 | 30.5 | NaN |
1960-02-01 | 30.6 | 0.327869 |
1960-03-01 | 30.6 | 0.000000 |
1960-04-01 | 30.6 | 0.000000 |
1960-05-01 | 30.6 | 0.000000 |
- Plot it. What patterns do you see?
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['inflation'], color='blue')
ax.set_ylabel('percent')
ax.set_title('U.S. CPI inflation')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- We computed the month-to-month inflation rate above. This is not the inflation rate we usually care about. Can you compute and plot the year-over-year inflation rate? For example, the inflation rate for 1962-05-01 would be the cpi in 1962-05-01 divided by the cpi in 1961-05-01. [Hint: You could do this with
.resample()
but you should also check the documentation for.pct_change()
.
fred['infl_year'] = fred['CPILFESL'].pct_change(periods=12)*100
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['infl_year'], color='blue')
ax.set_ylabel('percent')
ax.set_title('U.S. CPI inflation')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Add the following annotations:
- Call the decrease in inflaton around 1983 as 'Volker disinflation'.
- Call the recent increase in inflaton 'Covid inflation'.
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(fred.index, fred['infl_year'], color='blue')
ax.set_ylabel('percent')
ax.set_title('U.S. CPI inflation')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.text(dt.datetime(1983, 12,1), 12, 'Volker disinflation', ha='left')
ax.axvspan(dt.datetime(1980, 6,1),dt.datetime(1983, 10,1),color='red',alpha=.2)
ax.text(dt.datetime(2019, 12,1), 12, 'Covid inflation',ha='right')
ax.axvspan(dt.datetime(2020, 1,1),dt.datetime(2022, 10,1),color='red',alpha=.2)
plt.show()
2. Accessing APIs via Requests (top)¶
While pandas_datareader
is convenient, there are a lot of webservers equiped with APIs which are not included with the package. The more general approach to accessing web servers is to use the requests
package -- a package which we used last time for webscraping.
2.1 COVID19 Data¶
The Covid-19 Tracking project has a data api. Let's grab their data:
import requests # api module
url = 'https://api.covidtracking.com/v1/states/daily.json'
response = requests.get(url)
# Check out the HTML return
if response.status_code == 200:
print('Download successful')
covid = pd.read_json(response.text) # convert to dataframe
elif response.status_code == 301:
print('The server redirected to a different endpoint.')
elif response.status_code == 401:
print('Bad request.')
elif response.status_code == 401:
print('Authentication required.')
elif response.status_code == 403:
print('Access denied.')
elif response.status_code == 404:
print('Resource not found')
else:
print('Server busy')
# Clean-up data
covid['date'] = pd.to_datetime(covid['date'], format='%Y%m%d')
covid['month'] = covid['date'].dt.strftime('%B %Y')
covid.drop(covid[covid['date'] < '2020-03-01'].index, inplace=True) # Drop January and February when there are few cases
Download successful
Let's take a look:
covid.head()
date | state | positive | probableCases | negative | pending | totalTestResultsSource | totalTestResults | hospitalizedCurrently | hospitalizedCumulative | ... | deathIncrease | hospitalizedIncrease | hash | commercialScore | negativeRegularScore | negativeScore | positiveScore | score | grade | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-03-07 | AK | 56886.0 | NaN | NaN | NaN | totalTestsViral | 1731628.0 | 33.0 | 1293.0 | ... | 0 | 0 | dc4bccd4bb885349d7e94d6fed058e285d4be164 | 0 | 0 | 0 | 0 | 0 | March 2021 | |
1 | 2021-03-07 | AL | 499819.0 | 107742.0 | 1931711.0 | NaN | totalTestsPeopleViral | 2323788.0 | 494.0 | 45976.0 | ... | -1 | 0 | 997207b430824ea40b8eb8506c19a93e07bc972e | 0 | 0 | 0 | 0 | 0 | March 2021 | |
2 | 2021-03-07 | AR | 324818.0 | 69092.0 | 2480716.0 | NaN | totalTestsViral | 2736442.0 | 335.0 | 14926.0 | ... | 22 | 11 | 50921aeefba3e30d31623aa495b47fb2ecc72fae | 0 | 0 | 0 | 0 | 0 | March 2021 | |
3 | 2021-03-07 | AS | 0.0 | NaN | 2140.0 | NaN | totalTestsViral | 2140.0 | NaN | NaN | ... | 0 | 0 | f77912d0b80d579fbb6202fa1a90554fc4dc1443 | 0 | 0 | 0 | 0 | 0 | March 2021 | |
4 | 2021-03-07 | AZ | 826454.0 | 56519.0 | 3073010.0 | NaN | totalTestsViral | 7908105.0 | 963.0 | 57907.0 | ... | 5 | 44 | 0437a7a96f4471666f775e63e86923eb5cbd8cdf | 0 | 0 | 0 | 0 | 0 | March 2021 |
5 rows × 57 columns
Let's plot the time-series for several states to look for trends:
import matplotlib.pyplot as plt
subset = ['CA','GA','TX','ND','SD']
# Line graph
fig, ax = plt.subplots(figsize=(15,10))
# Figure 1: Raw Case Counts
for state in subset:
dff = covid[covid['state']==state]
ax.plot(
dff.date,
dff['positive'],
label = state
)
ax.set_ylabel('Total Number of Cases') # add the y-axis label
ax.set_title('COVID-19 Cases by Select States Over Time')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
ax.legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
# Add a dashed horizontal line at y=0
ax.axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);
C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\2769691991.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
Practice¶
- Pick a different set of states and plot them.
import matplotlib.pyplot as plt
subset = ['WI','NY','FL']
# Line graph
fig, ax = plt.subplots(figsize=(15,10))
# Figure 1: Raw Case Counts
for state in subset:
dff = covid[covid['state']==state]
ax.plot(
dff.date,
dff['positive'],
label = state
)
ax.set_ylabel('Total Number of Cases') # add the y-axis label
ax.set_title('COVID-19 Cases by Select States Over Time')
ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False) # get rid of the line on top
ax.legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
# Add a dashed horizontal line at y=0
ax.axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);
C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\1063602651.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
- Create a new variable called 'new cases' using the
.diff()
method. Note that.diff()
works just like.pct_change()
and.shift()
which we've used before.
covid = covid.sort_values(by=['state','date'])
covid.reset_index(inplace=True)
covid = covid[['date','state','positive']]
covid['new cases'] = covid.groupby('state')['positive'].diff().fillna(0).reset_index(0,drop=True)
covid['7-day rolling avg'] = covid.groupby('state')['new cases'].rolling(7).mean().fillna(0).reset_index(0,drop=True)
covid['14-day rolling avg'] = covid.groupby('state')['new cases'].rolling(14).mean().fillna(0).reset_index(0,drop=True)
covid[covid['state']=='GA'].head(10)
date | state | positive | new cases | 7-day rolling avg | 14-day rolling avg | |
---|---|---|---|---|---|---|
4043 | 2020-03-04 | GA | 2.0 | 0.0 | 0.000000 | 0.0 |
4044 | 2020-03-05 | GA | 2.0 | 0.0 | 0.000000 | 0.0 |
4045 | 2020-03-06 | GA | 2.0 | 0.0 | 0.000000 | 0.0 |
4046 | 2020-03-07 | GA | 6.0 | 4.0 | 0.000000 | 0.0 |
4047 | 2020-03-08 | GA | 7.0 | 1.0 | 0.000000 | 0.0 |
4048 | 2020-03-09 | GA | 12.0 | 5.0 | 0.000000 | 0.0 |
4049 | 2020-03-10 | GA | 17.0 | 5.0 | 2.142857 | 0.0 |
4050 | 2020-03-11 | GA | 22.0 | 5.0 | 2.857143 | 0.0 |
4051 | 2020-03-12 | GA | 31.0 | 9.0 | 4.142857 | 0.0 |
4052 | 2020-03-13 | GA | 42.0 | 11.0 | 5.714286 | 0.0 |
- Create a 2-x-1 figure where the top figure is total cases and the bottom is new cases.
import matplotlib.pyplot as plt
subset = ['CA','GA','TX','ND','SD']
# Line graph
fig, ax = plt.subplots(2,1,figsize=(15,10))
# Figure 1: Raw Case Counts
for state in subset:
dff = covid[covid['state']==state]
ax[0].plot(
dff.date,
dff['positive'],
label = state
)
ax[0].set_ylabel('Total Number of Cases') # add the y-axis label
ax[0].set_title('COVID-19 Cases by Select States Over Time')
ax[0].spines['right'].set_visible(False) # get ride of the line on the right
ax[0].spines['top'].set_visible(False) # get rid of the line on top
ax[0].legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()])
# Add a dashed horizontal line at y=0
ax[0].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);
# Figure 1: Raw Case Counts
for state in subset:
dff = covid[covid['state']==state]
ax[1].plot(
dff.date,
dff['7-day rolling avg'],
label = state
)
ax[1].set_ylabel('New COVID-19 Cases (7-day avg)') # add the y-axis label
ax[1].set_title('New COVID-19 Cases by Select States Over Time (7-day Avg)')
ax[1].spines['right'].set_visible(False) # get ride of the line on the right
ax[1].spines['top'].set_visible(False) # get rid of the line on top
ax[1].legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])
# Add a dashed horizontal line at y=0
ax[1].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\569824982.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()]) C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\569824982.py:48: UserWarning: FixedFormatter should only be used together with FixedLocator ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])
- Download population data from the following api
https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=state:*
url = 'https://api.census.gov/data/2019/pep/population?get=NAME,POP,DENSITY&for=state:*'
response = requests.get(url)
population = pd.read_json(response.text) # convert to dataframe
population.rename(columns=population.iloc[0],inplace=True)
population.drop(0,inplace=True)
population.drop(['state'], axis=1,inplace=True)
- Add the two digit postal codes using 'state_abbrev.csv' (in the class data folder) to the population data.
abbrev = pd.read_csv('./Data/state_abbrev.csv',header=None) # convert to dataframe
abbrev.rename(columns={0:'NAME',1:'state'},inplace=True)
population = population.merge(abbrev,on='NAME',how='outer')
population.drop(['NAME'], axis=1,inplace=True)
population['POP'] = population.POP.astype(float)
population['DENSITY'] = population.DENSITY.astype(float)
- Add the population data to the covid df and create a variable of new cases per 10k residents. Repeat (3) using this per capita measure.
# Add population information
covid = covid.merge(population,on='state',how='inner')
covid['positive pc'] = 1e+4*covid['positive']/covid['POP']
covid['new cases pc'] = 1e+4*covid['new cases']/covid['POP']
covid['7-day rolling avg pc'] = 1e+4*covid['7-day rolling avg']/covid['POP']
covid.head()
date | state | positive | new cases | 7-day rolling avg | 14-day rolling avg | POP | DENSITY | positive pc | new cases pc | 7-day rolling avg pc | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-03-06 | AK | NaN | 0.0 | 0.0 | 0.0 | 731545.0 | 1.281127 | NaN | 0.0 | 0.0 |
1 | 2020-03-07 | AK | NaN | 0.0 | 0.0 | 0.0 | 731545.0 | 1.281127 | NaN | 0.0 | 0.0 |
2 | 2020-03-08 | AK | NaN | 0.0 | 0.0 | 0.0 | 731545.0 | 1.281127 | NaN | 0.0 | 0.0 |
3 | 2020-03-09 | AK | NaN | 0.0 | 0.0 | 0.0 | 731545.0 | 1.281127 | NaN | 0.0 | 0.0 |
4 | 2020-03-10 | AK | NaN | 0.0 | 0.0 | 0.0 | 731545.0 | 1.281127 | NaN | 0.0 | 0.0 |
import matplotlib.pyplot as plt
subset = ['CA','GA','TX','ND','SD']
fig, ax = plt.subplots(2,1,figsize=(15,10))
for state in subset:
dff = covid[covid['state']==state]
ax[0].plot(
dff.date,
dff['positive pc'],
label = state
)
ax[0].set_ylabel('Total Number of Cases per 10K Residents') # add the y-axis label
ax[0].set_title('COVID-19 Cases per 10K Residents by Select States Over Time')
ax[0].spines['right'].set_visible(False) # get ride of the line on the right
ax[0].spines['top'].set_visible(False) # get rid of the line on top
ax[0].legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()])
# Add a dashed horizontal line at y=0
ax[0].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);
for state in subset:
dff = covid[covid['state']==state]
ax[1].plot(
dff.date,
dff['7-day rolling avg pc'],
label = state
)
ax[1].set_ylabel('New COVID-19 Cases per 10K Residents (7-day avg)') # add the y-axis label
ax[1].set_title('New COVID-19 Cases per 10K Residents by Select States Over Time (7-day Avg)')
ax[1].spines['right'].set_visible(False) # get ride of the line on the right
ax[1].spines['top'].set_visible(False) # get rid of the line on top
ax[1].legend(frameon=False) # Show the legend. frameon=False kills the box around the legend
# Format y-axis to include commas. Note: the folllowing code loops through the ticks and converts each to thousands format (ie, inlcude commas)
ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])
# Add a dashed horizontal line at y=0
ax[1].axhline(y=0, color='black', linewidth=0.75, dashes=[5,5]);
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\1046391063.py:25: UserWarning: FixedFormatter should only be used together with FixedLocator ax[0].set_yticklabels(['{:,}'.format(int(x)) for x in ax[0].get_yticks().tolist()]) C:\Users\jt83241\AppData\Local\Temp\ipykernel_48136\1046391063.py:48: UserWarning: FixedFormatter should only be used together with FixedLocator ax[1].set_yticklabels(['{:,}'.format(int(x)) for x in ax[1].get_yticks().tolist()])
2.2 S&P 500 Stock Prices¶
In the code below we access the wikipedia page which lists the S&P500 company tickers (link). We use requests
to grab the html code sent to our computer, use the python package "Beautiful Soup" (bs4
) to parse the html code, and grab the tickers. I wrote this as a function but frankly that's just to remind you how user-defined functions work.
import bs4 as bs
# Scrap sp500 tickers
def save_sp500_tickers():
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text, 'html')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
ticker = row.findAll('td')[0].text
if not '.' in ticker:
tickers.append(ticker.replace('\n',''))
return tickers
tickers = save_sp500_tickers()
Now use the Yahoo! Finance API to download the stock history for each ticker in the S&P500. NB, this API isn't actually maintained by Yahoo! anymore and that requesting data prior to 1970 won't work. You'll have to install it:
pip install yfinance
import yfinance as yf
prices = yf.download(tickers, start='2018-01-01')['Adj Close']
prices.to_csv('stock_data.csv',index=False)
[*********************100%***********************] 501 of 501 completed
Generate Log-Returns¶
Define "log-returns" as the difference in log-value of an asset during time interval (t-1, t): $$ r_{it}= \log\bigg(\frac{p_t}{p_{t-1}}\bigg)=\log(p_t)-\log(p_{t-1}) $$ Recall that the log function is an approximation for percentage change. Log-returns are useful in quantitative finance for a number of reasons (stationarity, log-normality, etc.) and they're time-additive so computing cumulative returns (i.e. past performance) is easy. **Note: We'll see this math when we do Principal Component Analysis (PCA) later in the semester to estimate a Capital Asset Pricing Model (CAPM).**
Time to generate log-returns for the S&P500 stocks.
rs = prices.apply(np.log).diff(1)
Visualization¶
Let's generate a plot of log returns across time across the S&P500 stocks. There will bee a lot of lines and the end-product will look like a Jackson Pollock painting.
fig, ax = plt.subplots(figsize=(15,10))
rs.plot(ax=ax,legend=False) # We can use pandas for plotting too!
ax.axhline(y=0,linestyle='--',color='gray')
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_title('Daily Returns of Companies in the S&P 500',fontsize=24,pad=20)
ax.set_facecolor('white')
plt.xticks(rotation = 0, size=18)
ax.xaxis.set_ticks_position('none') # Remove the small tick marks
plt.yticks(size=18)
sns.despine(ax=ax)
plt.show()