Lecture 9: Visualizing Data with Matplotlib [SUGGESTED SOLUTIONS]

We have a handle on python now: we understand the data structures and enough about working with them to move on to stuff more directly relevant to data analysis. We know how to get data into Pandas from files, how to manipulate DataFrames and how to do basic statistics.

Let's get started on making figures, arguably the best way to convey information about our data.

Today, we will cover:

Matplotlib

Histograms

Subplots

Bar Charts

Scatter Plots

Class Announcements

PS2 (due 9/18) is on eLC.

1. Matplotlib (top)¶

Matplotlib is a very popular package that bundles tools for creating visualizations. The documentation is here. We will look at some specific plot types in class, but you can learn about many different types thumbnail gallery. [Warning: not all the figures in the thumbnail gallery are good figures.]

We start in the usual way by loading packages.

In [1]:

import pandas as pd     #load the pandas package and call it pd
import matplotlib.pyplot as plt   # load the pyplot set of tools from the package matplotlib. Name it plt for short.

And now let's go back to our principles of macroeconomics days and look at some national income account data.

In [2]:

import pandas as pd     #load the pandas package and call it pd
import matplotlib.pyplot as plt   # load the pyplot set of tools from the package matplotlib. Name it plt for short.

gdp = pd.read_csv('./Data/gdp_components_simple.csv', index_col=0)  # load data from file, make date the index

print(gdp.head(2))                                    # print the first and last few rows to make sure all is well
print('\n', gdp.tail(2))

         GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                         
1929  104.556  17.170   9.622   5.939   5.556
1930   92.160  11.428  10.273   4.444   4.121

            GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
DATE                                                   
2016  18707.189  3169.887  3290.979  2217.576  2738.146
2017  19485.394  3367.965  3374.444  2350.175  2928.596

I don't like these variable names.

In [3]:

gdp.rename(columns = {'GDPA':'gdp', 'GPDIA':'inv', 'GCEA':'gov', 'EXPGSA':'ex', 'IMPGSA':'im' }, inplace=True)

Let's get plotting. matplotlib graphics are based around two new object types.

The figure object: think of this as the canvas we will draw figures onto
The axes object: think of this as the figure itself and all the components

To create a new figure, we call the subplots() method of plt. Notice the use of multiple assignment.

In [4]:

fig, ax = plt.subplots()    # passing no arguments gets us one fig object and one axes object
plt.show()  # tells jupyter to show the figure

In [5]:

print(type(fig))

print(type(ax))

<class 'matplotlib.figure.Figure'>
<class 'matplotlib.axes._subplots.AxesSubplot'>

We apply methods to the axes to actually plot the data. Here is a scatter plot. [Try ax. and hit TAB...]

In [6]:

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'])                  # scatter plot of gdp vs. time
plt.show()  # tells jupyter to show the figure

First, note that the plot is a Line2D object. This is absolutely not important for us, but when you see jupyter print out <matplotlib.lines.Line2D at ...> that is what it is telling us. Everything in python is an object.

Second, a scatter plot needs two columns of data, one for the x-coordinate and one for the y-coordinate. I am using gdp for the y-coordinate and the years for the x-coordinate. I set years as the index variable, so to retrieve it I used the .index attribute.

Third, this plot needs some work. I do not like this line color. More importantly, I am missing labels and a title. These are extremely important.

In [7]:

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red'                   # set the line color to red
       )                  

ax.set_ylabel('billions of dollars')  # add the y-axis label
ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product')
plt.show()  # tells jupyter to show the figure

This is looking pretty good. While I am a fanatic when it comes to labeling things, I probably wouldn't label the x-axis. You have to have some faith in the reader.

I also do not like 'boxing' my plots. There is a philosophy about visualizations that says: Every mark on your figure should convey information. If it does not, then it is clutter and should be removed. I am not sure who developed this philosophy (Marie Kondo?) but I think it is a useful benchmark.

In [8]:

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red'                   # set the line color to red
       )  

ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

plt.show()  # tells jupyter to show the figure

Practice: Line Plots ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Copy the code from the last plot and add a second line that plots 'gov'. To this, just add a new line of code to the existing code. ax.plot(gdp.index, gdp['gov'])

In [9]:

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red'                   # set the line color to red
       )  

ax.plot(gdp.index, gdp['gov'],        # line plot of gdp vs. time
        color='blue',                   # set the line color to blue
        alpha = 0.5,
        linestyle = ':'
       )  
ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

plt.show()  # tells jupyter to show the figure

Modify your code to give the figure a better title
Modify your code to make government consumption blue
Modify your code to add the argument alpha=0.5 to the plot method for gov. What does it change? If you want to learn more try 'alpha composite' in Google.
Modify your code to make the gov line dashed. Try the argument linestyle='--'. What is linestyle '-.' or ':' ?

A few more options to get us started¶

We have two lines on our figure. Which one is which? Not labeling our line is malpractice. Two approaches

Add a legend
Add text to the figure

Both are good options. I prefer the second for simple plots.

In [10]:

# The first option. Add labels to your plot commands, then call ax.legend.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red',                   # set the line color to red
       label = 'GDP'
       )  

ax.plot(gdp.index, gdp['gov'],        # line plot of gdp vs. time
        color='blue',                   # set the line color to blue
        alpha = 0.5,
        linestyle = ':',
        label = 'Gov. Spending'
       )  
ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.legend(frameon=False)                           # Show the legend. frameon=False kills the box around the legend

plt.show()  # tells jupyter to show the figure

Ah, I feel much better now that I know which line is which. Here is the second approach.

In [11]:

# The second option. Add text using the annotate method. Note that I can leave the labels in the plot commands.

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red',                   # set the line color to red
       label = 'GDP'
       )  

ax.plot(gdp.index, gdp['gov'],        # line plot of gdp vs. time
        color='blue',                   # set the line color to blue
        alpha = 0.5,
        linestyle = ':',
        label = 'Gov. Spending'
       )  
ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.text(1989, 8500, 'GDP')            # text(x, y, string)
ax.text(1999, 4500, 'Gov. Spending')            # text(x, y, string)

plt.show()  # tells jupyter to show the figure

Getting plots out of your notebook¶

While I love jupyter notebooks, my research output is usually an article distributed as a pdf.

In [12]:

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp'],        # line plot of gdp vs. time
        color='red',                   # set the line color to red
       label = 'GDP'
       )  

ax.plot(gdp.index, gdp['gov'],        # line plot of gdp vs. time
        color='blue',                   # set the line color to blue
        alpha = 0.5,
        linestyle = ':',
        label = 'Gov. Spending'
       )  
ax.set_ylabel('billions of dollars')  # add the y-axis label
# ax.set_xlabel('year')                 # add the x-axis label
ax.set_title('U.S. Gross Domestic Product and Government Spending')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.text(1989, 8500, 'GDP')            # text(x, y, string)
ax.text(1999, 4500, 'Gov. Spending')            # text(x, y, string)

plt.savefig('gdp.pdf', bbox_inches='tight')          # Create a pdf and save to cwd 
plt.savefig('../gdp.png')          # Create a png and save to the folder that contains the cwd

plt.show()  # tells jupyter to show the figure

When saving a pdf, I use the bbox_inches='tight' argument to kill extra whitespace around the figure. You can also set things like orientation, dpi, and metadata. Check the documentation if you need to tweak your output.

2. Histograms (top)¶

The line plot is the tip of the iceberg. matplotlib support many plot types. Let's take a look at histograms.

How variable is US gdp growth?

In [13]:

# Create a histogram of gdp growth rates.

gdp['gdp_growth'] = gdp['gdp'].pct_change()*100 # pct_change() creates growth rates NOT percent change. Not a self-documenting name.
gdp.head()

Out[13]:

	gdp	inv	gov	ex	im	gdp_growth
DATE
1929	104.556	17.170	9.622	5.939	5.556	NaN
1930	92.160	11.428	10.273	4.444	4.121	-11.855848
1931	77.391	6.549	10.169	2.906	2.905	-16.025391
1932	59.522	1.819	8.946	1.975	1.932	-23.089248
1933	57.154	2.276	8.875	1.987	1.929	-3.978361

We could have used the diff() or the shift() methods to do something similar, but wow, pct_change is so luxe. A quick plot to take a look.

In [14]:

fig, ax = plt.subplots() 
ax.plot(gdp.index, gdp['gdp_growth'],        # line plot of gdp vs. time
        color='red',                   # set the line color to red
       label = 'GDP Growth'
       )  

ax.set_ylabel('percent growth')  # add the y-axis label
ax.set_title('U.S. Gross Domestic Product Growth Rates')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.axhline(y=0, color='black', linewidth=0.75)  # Add a horizontal line at y=0

plt.show()  # tells jupyter to show the figure

The great depression and the WWII buildup really stick out.

Notice that I added a line at zero. My thinking is that this line adds information: the reader can easily see that growth rates are mostly positive and that the great depression was really bad.

It is also obvous that the volitility of gdp has fallen over time, but let's approach a bit differently.

In [15]:

fig, ax = plt.subplots() 

# hist does not like NaN. (I'm a bit surprised.) I use the dropna() method to kill off the missing value
ax.hist(gdp['gdp_growth'].dropna(), bins=20, color='red', alpha=0.75)        # histogram of GDP growth rates
      

ax.set_ylabel('Frequency')  # add the y-axis label
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1929-2017)')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

#ax.axhline(y=0, color='black', linewidth=0.75)  # Add a horizontal line at y=0

plt.show()  # tells jupyter to show the figure

Practice: Histograms ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Break the data up into two periods: 1929-1985 and 1985-2017
Compute the mean and the standard deviation for the gdp growth rate in each sample.
Create a separate histogram for each sample. Make the early period historgram blue and the late historgram black. Make any changes to them that you deem appropriate.
Use text() to add the mean and std to a blank area of the histograms.
Save the two histograms as pdfs. Give them reasonable names.

Challenging. Can you find a way to store the value of the mean and std to a variable and print the variable out on the histogram? Redo part 4.

In [16]:

gdp_early = gdp[gdp.index <= 1986]
gdp_late = gdp[gdp.index > 1985]

avg_early = gdp_early['gdp_growth'].mean()
sd_early = gdp_early['gdp_growth'].std()

avg_late = gdp_late['gdp_growth'].mean()
sd_late = gdp_late['gdp_growth'].std()

print(avg_early, sd_early)
print(avg_late, sd_late)

7.195155329738043 8.32030467983293
4.822433250409485 1.886061415326792

In [17]:

fig, ax = plt.subplots() 

# hist does not like NaN. (I'm a bit surprised.) I use the dropna() method to kill off the missing value
ax.hist(gdp_early['gdp_growth'].dropna(), bins=20, color='red', alpha=0.75)        # histogram of GDP growth rates
      
ax.set_ylabel('Frequency')  # add the y-axis label
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1929-1985)')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.text(-20,14,'Avg GDP: '+str(round(avg_early,2)))
ax.text(-20,12,'Syd GDP: '+str(round(sd_early,2)))

plt.show()  # tells jupyter to show the figure

In [18]:

fig, ax = plt.subplots() 

# hist does not like NaN. (I'm a bit surprised.) I use the dropna() method to kill off the missing value
ax.hist(gdp_late['gdp_growth'].dropna(), bins=20, color='red', alpha=0.75)        # histogram of GDP growth rates
      
ax.set_ylabel('Frequency')  # add the y-axis label
ax.set_xlabel('Annual growth rate (%)')
ax.set_title('Frequency of US GDP growth rates (1986-2017)')

ax.spines['right'].set_visible(False) # get ride of the line on the right
ax.spines['top'].set_visible(False)   # get rid of the line on top

ax.text(-1.5,4,'Avg GDP: '+str(round(avg_early,2)))
ax.text(-1.5,3.5,'Syd GDP: '+str(round(sd_early,2)))

plt.show()  # tells jupyter to show the figure

3. Subplots (top)¶

We can generate several axes in one figure using the subplot() method. [This method is not misnamed!]

In [19]:

fig, ax = plt.subplots(1, 2)  # one row, two columns of axes

In [20]:

print(type(ax))

<class 'numpy.ndarray'>

So ax is now an array that holds the axes for each plot. Each axes works just like before. Now we just have to tell python which axes to act on.

In [21]:

# Set a variable for plot color so I can change it everywhere easily
my_plot_color = 'black'

# I am using the figsize parameter here. It takes (width, height) in inches. 
fig, ax = plt.subplots(1, 2, figsize=(10,4))  # one row, two columns of axes

# The fist plot
ax[0].plot(gdp.index, gdp['gdp_growth'], color=my_plot_color, label = 'GDP Growth')     # a line plot of GDP growth rates
ax[0].axhline(y=0, color='black', linewidth=0.75)  # Add a horizontal line at y=0
ax[0].set_xlabel('year')
ax[0].set_title('GDP growth rates')
ax[0].spines['right'].set_visible(False) # get ride of the line on the right
ax[0].spines['top'].set_visible(False)   # get rid of the line on top

# The second plot
ax[1].hist(gdp['gdp_growth'].dropna(), bins=20, color=my_plot_color, alpha=0.25)        # histogram of GDP growth rates
ax[1].set_xlabel('annual growth rate')
ax[1].set_title('Histogram of GDP growth rates')
ax[1].spines['right'].set_visible(False) # get ride of the line on the right
ax[1].spines['top'].set_visible(False)   # get rid of the line on top

plt.savefig('double.pdf')

plt.show()  # tells jupyter to show the figure

You can imagine how useful this can be. We can loop over sets of axes and automate making plots if we have several variables.

I changed a couple other things here, too.

I used the figsize parameter to subplot. This is a tuple of figure width and height in inches. (Inches! Take that rest of the world!) The height and width are of the printed figure. You will notice that jupyter notebook scaled it down for display. This is useful when you are preparing graphics for a publication and you need to meet an exact figure size.

I made the line color a variable, so it is easy to change all the line colors at one. For example, I like red figures when I am giving presentations, but black figures when I am creating pdfs that will be printed out on a black and white printer.

4. Bar charts (top)¶

Bar charts are useful for describing relatively few observations of categorical data --- meaning that one of the axes is not quantitative. Tufte would complain that they have a lot of redundant ink, but they are quite popular...and Tufte is not our dictator. Although, it's always good to think about what our figures are doing for us.

Bar charts are much better than pie charts for displaying the relative size of data. There are discussions of this all over the net (here is one I like) but the anti-pie-chart argument boils down to: pie charts are hard to read.

Humans are bad at judging the relative sizes of 2D spaces. They cannot tell if one slice is 10% larger than another slice.
The MS Excel style of coloring the slice different colors creates problems. Humans judge darker colors to have larger areas.
To get quantitative traction, people label the slices with the data values. In this case, a table of numbers is probably a better way to share the data.

In [22]:

# PPP GDP data from the penn world tables 

code    = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
country = ['United States', 'France', 'Japan', 'China', 'India',
             'Brazil', 'Mexico']
gdppc   = [53.1, 36.9, 36.3, 11.9, 5.4, 15.0, 16.5]

gdp = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code)
gdp

Out[22]:

	gdppc	country
USA	53.1	United States
FRA	36.9	France
JPN	36.3	Japan
CHN	11.9	China
IND	5.4	India
BRA	15.0	Brazil
MEX	16.5	Mexico

In [23]:

fig, ax = plt.subplots(figsize=(10,5))

ax.bar(gdp.index, gdp['gdppc'], color='blue', alpha=0.25)      # bar(x labels, )

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_ylabel('PPP GDP per capita')
ax.set_title('Income per person (at purchasing power parity)')

plt.show()  # tells jupyter to show the figure

The ordering of the bars is pretty random. We could sort it poor to rich.

In [24]:

fig, ax = plt.subplots(figsize=(10,5))

gdp_sort= gdp.sort_values('gdppc')

ax.bar(gdp_sort.index, gdp_sort['gdppc'], color='blue', alpha=0.25)      # bar(x labels, )
ax.grid(axis='y', color='white')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_title('Income per person (at purchasing power parity)')
ax.set_ylabel('PPP GDP per capita')

plt.show()  # tells jupyter to show the figure

Notice the use of grid() to specify grid lines on the y axis. I made them white, so they only show up in the bars. It's something I'm experimenting with. I'm not sure I like it.

Maybe you prefer a horizontal bar chart. Same data, same approach. We need to swap all the y labels for x labels.

Practice: Bar Charts ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Create a horizontal bar chart. Check the documentation for barh()
Fix up your figure labels, etc.

In [25]:

fig, ax = plt.subplots(figsize=(10,5))

gdp_sort= gdp.sort_values('gdppc')

ax.barh(gdp_sort.index, gdp_sort['gdppc'], color='red', alpha=0.25)      # bar(x labels, )
ax.grid(axis='x', color='white')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_xlabel('PPP GDP per capita')
ax.set_title('Income per person (at purchasing power parity)')

plt.show()  # tells jupyter to show the figure

Create a new horizontal bar chart where each bar is gdp per capita relative to the United States. So USA =1, MEX = 0.31, etc.

In [26]:

gdp_sort['rel_gdp'] = gdp_sort['gdppc']/gdp_sort.loc['USA', 'gdppc']

fig, ax = plt.subplots(figsize=(10,5))

ax.barh(gdp_sort.index, gdp_sort['rel_gdp'], color='red', alpha=0.25)      # bar(x labels, )
ax.grid(axis='x', color='white')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_xlabel('PPP GDP per capita relative to the United States')
ax.set_title('Income per person (at purchasing power parity)')

plt.show()  # tells jupyter to show the figure

5. Scatter plots (top)¶

Scatter plots are used to compare two variables. A very common approach to visualize the correlation of two variables. Let's load some data from the internet using the FRED API. APIs are useful methods to get great data. We'll learn more about APIs in a future lecture.

In [27]:

from pandas_datareader import data, wb    # we are grabbing the data and wb functions from the package
import datetime as dt                     # for time and date. We'll learn more about datetime in the next lecture when we do time-series

codes = ['GDPC1', 'UNRATE']               # real gdp, unemployment rate
start = dt.datetime(1970, 1, 1)
fred = data.DataReader(codes, 'fred', start)

fred.head()

Out[27]:

	GDPC1	UNRATE
DATE
1970-01-01	4939.759	3.9
1970-02-01	NaN	4.2
1970-03-01	NaN	4.4
1970-04-01	4946.770	4.6
1970-05-01	NaN	4.8

Ugh. The gdp data are quarterly, but the unemployment rate is monthly. Let's fix this by downsampling to quarterly frequency.

In [28]:

fred_q=fred.resample('q').mean()                # Create an average quarterly unemployment rate
fred_q.head()

Out[28]:

	GDPC1	UNRATE
DATE
1970-03-31	4939.759	4.166667
1970-06-30	4946.770	4.766667
1970-09-30	4992.357	5.166667
1970-12-31	4938.857	5.833333
1971-03-31	5072.996	5.933333

In [29]:

fred_q['gdp_gr'] = fred_q['GDPC1'].pct_change()*100        # growth rate of gdp. we've seen this a few times...
fred_q['unemp_dif'] = fred_q['UNRATE'].diff()              # difference takes the first difference: u(t)-u(t-1)   
fred_q.head()

Out[29]:

	GDPC1	UNRATE	gdp_gr	unemp_dif
DATE
1970-03-31	4939.759	4.166667	NaN	NaN
1970-06-30	4946.770	4.766667	0.141930	0.600000
1970-09-30	4992.357	5.166667	0.921551	0.400000
1970-12-31	4938.857	5.833333	-1.071638	0.666667
1971-03-31	5072.996	5.933333	2.715993	0.100000

In [30]:

fig, ax = plt.subplots(figsize=(10,5))
                       
ax.scatter(fred_q.gdp_gr, fred_q.unemp_dif)

ax.set_title('Okun\'s Law in the United States' )
ax.set_ylabel('change in unemployment rate')
ax.set_xlabel('gdp growth rate')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()  # tells jupyter to show the figure

Practice: Scatter Plots ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Let's explore some of scatter plot's options.

Change the color of the dots to red and lighten them up using alpha

In [31]:

fig, ax = plt.subplots(figsize=(10,5))
                       
ax.scatter(fred_q.gdp_gr, fred_q.unemp_dif, color='red', alpha = 0.25, marker = '^')

ax.text(fred_q.loc['1971-3-31', 'gdp_gr']+0.1, fred_q.loc['1971-3-31', 'unemp_dif'], '2009Q3', ha='left')

ax.set_title('Okun\'s Law in the United States' )
ax.set_ylabel('change in unemployment rate')
ax.set_xlabel('gdp growth rate')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()  # tells jupyter to show the figure

Check out the documentation for marker styles.

Change the marker to a triangle.
Use text() or annotate() to label the point corresponding to third quarter 2009: '2009Q3'

In [32]:

fig, ax = plt.subplots(figsize=(10,5))
                       
ax.scatter(fred_q.gdp_gr, fred_q.unemp_dif, color='red', alpha = 0.25, marker = '^')

ax.text(fred_q.loc['1971-3-31', 'gdp_gr']+0.1, fred_q.loc['1971-3-31', 'unemp_dif'], '2009Q3', ha='left')

ax.set_title('Okun\'s Law in the United States' )
ax.set_ylabel('change in unemployment rate')
ax.set_xlabel('gdp growth rate')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()  # tells jupyter to show the figure

Scatter plots are very useful and we can do a lot more with them. Places to go from here.

Add a line of best fit. A bit clunky in matplotlib (use np's polyfit command), but not too bad. Seaborn is a package that automates some matplotlib commands while also introducing some new, useful plots. For example, it has a regplot command that makes adding a trend line simple.

import seaborn as sns

sns.regplot(x="gdp_gr", y="unemp_dif", data=fred, ax=ax)  # the ax=ax tells to apply reglplot to the plot "ax"

Make data markers different colors or sizes depending on the value of a third variable. For example, you could get some more data and color the markers for years with a repbulican president red and markers for years with democratic presidents blue.

Zoom-in on the bulk of the data by either dropping outliers or changing the axis limits. You can zoom-in on either (or both) axes using

ax.set_xlim(xmin,xmax)  # xmin is the lower bound, xmax is the upper bound
ax.set_ylim(ymin,ymax)  # ymin is the lower bound, ymax is the upper bound

Other ideas?

In [ ]:

Lecture 9: Visualizing Data with Matplotlib [SUGGESTED SOLUTIONS]

Class Announcements

1. Matplotlib (top)¶

Practice: Line Plots ¶

A few more options to get us started¶

Getting plots out of your notebook¶

2. Histograms (top)¶

Practice: Histograms ¶

3. Subplots (top)¶

4. Bar charts (top)¶

Practice: Bar Charts ¶

5. Scatter plots (top)¶

Practice: Scatter Plots ¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia