Lecture 13: Merging DataFrames [SUGGESTED SOLUTIONS]
We will often find ourselves with variables spread across different datasets and files. We merge datasets together by matching up the two datasets on one or more variables. For example, I might have GDP data by country from the Penn World Tables, and demographic data by country from the World Bank. We would merge these two datasets and match up the observations by country.
Here's an outline of today's lecture:
1. Apartment Data from Zillow ¶
We are working with data from Zillow on rental listing prices. I would like to see how studio and one bedroom prices differ from each other across cities. Unfortunately, Zillow only lets me download one series at a time. (Hey, the data are free so I'm not complaining...maybe) I have downloaded the studio and one bedroom rental listing pricing and now I would like to create one DataFrame with all the data.
Time to get to work.
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
# The data come as csv files. Load them into separate DatFrames.
studios = pd.read_csv('./Data/Metro_MedianRentalPrice_Studio.csv')
onebeds = pd.read_csv('./Data/Metro_MedianRentalPrice_1Bedroom.csv')
# Always check out new data! You never know what could be lurking.
print(studios.head())
print(studios.tail())
RegionName SizeRank 2010-01 2010-02 2010-03 2010-04 2010-05 \ 0 United States 0 1300.0 1500.0 1500.0 1500.0 1500.0 1 New York, NY 1 NaN 2000.0 2000.0 2100.0 2000.0 2 Los Angeles, CA 2 NaN NaN NaN NaN NaN 3 Chicago, IL 3 NaN 1500.0 1500.0 1500.0 1500.0 4 Dallas, TX 4 NaN NaN NaN NaN NaN 2010-06 2010-07 2010-08 ... 2017-11 2017-12 2018-01 2018-02 \ 0 1500.0 1500.0 1500.0 ... 1495.0 1500.0 1495.0 1500.0 1 2125.0 2000.0 1950.0 ... 2300.0 2300.0 2300.0 2300.0 2 NaN NaN NaN ... 1800.0 1815.0 1795.0 1807.0 3 1500.0 1500.0 1500.0 ... 1532.5 1500.0 1500.0 1500.0 4 NaN NaN NaN ... 1300.0 1300.0 1342.5 1330.0 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 0 1500.0 1520.0 1550.0 1550.0 1550.0 1550.0 1 2300.0 2350.0 2400.0 2400.0 2400.0 2400.0 2 1805.0 1829.0 1850.0 1810.0 1835.0 1877.0 3 1500.0 1492.0 1500.0 1500.0 1510.0 1525.0 4 1325.0 1300.0 1325.0 1350.0 1355.0 1350.0 [5 rows x 106 columns] RegionName SizeRank 2010-01 2010-02 2010-03 2010-04 2010-05 \ 162 Cheyenne, WY 162 NaN NaN NaN NaN NaN 163 Branson, MO 163 NaN NaN NaN NaN NaN 164 Hinesville, GA 164 NaN NaN NaN NaN NaN 165 Waynesville, MO 165 NaN NaN NaN NaN NaN 166 Enterprise, AL 166 NaN NaN NaN NaN NaN 2010-06 2010-07 2010-08 ... 2017-11 2017-12 2018-01 2018-02 \ 162 NaN NaN NaN ... 1200.0 1250.0 1200.0 1225.0 163 NaN NaN NaN ... 800.0 800.0 825.0 800.0 164 NaN NaN NaN ... 1150.0 1150.0 1150.0 1200.0 165 NaN NaN NaN ... 1050.0 1037.5 950.0 1025.0 166 NaN NaN NaN ... 950.0 950.0 950.0 950.0 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 162 1200.0 1250.0 1285.0 1295.0 1295.0 1217.5 163 795.0 850.0 885.0 850.0 862.5 850.0 164 1200.0 1200.0 1250.0 1225.0 1250.0 1150.0 165 1025.0 1050.0 1075.0 1095.0 1025.0 1025.0 166 925.0 900.0 900.0 925.0 925.0 950.0 [5 rows x 106 columns]
print(onebeds.head())
print(onebeds.tail())
RegionName SizeRank 2010-02 2010-03 2010-04 2010-05 2010-06 \ 0 United States 0 NaN 1200.0 1245.0 1200.0 1250.0 1 New York, NY 1 NaN NaN NaN NaN NaN 2 Los Angeles, CA 2 NaN 1380.0 1425.0 1420.0 1437.5 3 Chicago, IL 3 NaN 1050.0 1100.0 1050.0 1075.0 4 Dallas, TX 4 NaN NaN NaN NaN NaN 2010-07 2010-08 2010-09 ... 2017-11 2017-12 2018-01 2018-02 \ 0 1200.0 1200.0 1044.0 ... 1400.0 1400.0 1400.0 1405.0 1 NaN NaN NaN ... 2095.0 2050.0 2050.0 2085.0 2 1550.0 1600.0 1388.0 ... 1961.0 1947.0 1930.0 1950.0 3 995.0 950.0 1090.0 ... 1525.0 1495.0 1477.0 1495.0 4 NaN NaN 1000.0 ... 1217.0 1205.0 1200.0 1242.0 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 0 1425.0 1434.0 1450.0 1450.0 1460.0 1460.0 1 2100.0 2135.0 2150.0 2192.0 2155.0 2153.0 2 1960.0 1950.0 1995.0 2026.0 2050.0 2095.0 3 1495.0 1495.0 1495.0 1495.0 1500.0 1500.0 4 1235.0 1235.0 1252.0 1255.0 1267.0 1266.0 [5 rows x 105 columns] RegionName SizeRank 2010-02 2010-03 2010-04 2010-05 2010-06 \ 219 Casper, WY 219 NaN NaN NaN NaN NaN 220 Minot, ND 220 NaN NaN NaN NaN NaN 221 Waynesville, MO 221 NaN NaN NaN NaN NaN 222 Clovis, NM 222 NaN NaN NaN NaN NaN 223 Rolla, MO 223 NaN NaN NaN NaN NaN 2010-07 2010-08 2010-09 ... 2017-11 2017-12 2018-01 2018-02 \ 219 NaN NaN NaN ... 662.5 699.5 650.0 650.0 220 NaN NaN NaN ... 495.0 495.0 495.0 495.0 221 NaN NaN NaN ... 535.0 535.0 535.0 530.0 222 NaN NaN NaN ... 550.0 525.0 505.0 505.0 223 NaN NaN NaN ... 422.5 425.0 425.0 442.5 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 219 650.0 625.0 675.0 650.0 637.5 599.0 220 495.0 495.0 495.0 535.0 515.0 495.0 221 532.5 535.0 542.5 540.0 535.0 545.0 222 500.0 515.0 525.0 585.0 585.0 515.0 223 450.0 450.0 450.0 450.0 450.0 450.0 [5 rows x 105 columns]
A few things to note.¶
- There are more regions in the one-bedroom data (i.e., there are more rows).
- Date is listed as columns.
What do we want?¶
- One DataFrame with one-bedroom and studio prices.
- A multiIndex of regions and date on the rows (or a column for region and a column for date).
- Columns of studio and one bedroom prices.
# Drop columns we don't need.
studios = studios.drop('SizeRank', axis=1)
onebeds = onebeds.drop('SizeRank', axis=1)
Make the data long¶
We could assign indexes and reshape this using stack
or we can use melt
and skip the indexing.
studios = pd.melt(studios, ['RegionName']) # we spec RegionName as the grouping variable
studios.head()
RegionName | variable | value | |
---|---|---|---|
0 | United States | 2010-01 | 1300.0 |
1 | New York, NY | 2010-01 | NaN |
2 | Los Angeles, CA | 2010-01 | NaN |
3 | Chicago, IL | 2010-01 | NaN |
4 | Dallas, TX | 2010-01 | NaN |
studios.shape
(17368, 3)
There are 17,368 observations (i..e, region-date pairs) in the studio dataframe.
Rename variables¶
studios.rename(columns={'RegionName':'region', 'variable':'date', 'value':'studio_price'}, inplace=True)
studios.head()
region | date | studio_price | |
---|---|---|---|
0 | United States | 2010-01 | 1300.0 |
1 | New York, NY | 2010-01 | NaN |
2 | Los Angeles, CA | 2010-01 | NaN |
3 | Chicago, IL | 2010-01 | NaN |
4 | Dallas, TX | 2010-01 | NaN |
Repeat for one bedroom data¶
The studio data look great. Get the one bedroom data set up.
onebeds = pd.melt(onebeds, ['RegionName'])
onebeds.rename(columns={'RegionName':'region', 'variable':'date', 'value':'onebed_price'}, inplace=True)
onebeds.head()
region | date | onebed_price | |
---|---|---|---|
0 | United States | 2010-02 | NaN |
1 | New York, NY | 2010-02 | NaN |
2 | Los Angeles, CA | 2010-02 | NaN |
3 | Chicago, IL | 2010-02 | NaN |
4 | Dallas, TX | 2010-02 | NaN |
onebeds.shape
(23072, 3)
There are 23,072 observations (i..e, region-date pairs) in the one-bedroom dataframe so there are more region-date pairs in these data.
2. Merging DataFrames into One DataFrame (top)¶
Now that we're done with the wind-up, time to do what this notebook is about: Merging two sets of data.
We want to match the two DataFrames together according to the region-date pairs. In database-ese, we refer to variables we are matching as keys and the keys, here, are region
and date
. We could match these keys in potentially different ways. For example, there may be several observations for a given key in one set of data that the other set. What to do then? We have to tell the computer how to join (more database-ese) the two DataFrames. There are four different options for join:
- inner: keep the intersection of the keys
- left: keep all the keys from the left DataFrame
- right: keep all the keys from right DataFrame
- outer: keep all the keys from both DataFrames
We specify the join type with the how
parameter. The default is inner, but for sanity's sake, be explicit about your join.
Back to our example. We want to merge the one-bedroom and studio dataframes. Both are organized by the region
and date
and these keys form a unique observation for each dataframe. Let's try joining by inner
which implicitly will drop any observations in one dataframe but not the other. Had we matched by outer
we would have kept these regardlessof which dataframe the extra observations reside. What would left
and right
do?
# left and right specify the DataFrames to merge, on specifies the keys (we are using two keys)
rental_prices = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how='inner')
rental_prices.head()
region | date | studio_price | onebed_price | |
---|---|---|---|---|
0 | United States | 2010-02 | 1500.0 | NaN |
1 | New York, NY | 2010-02 | 2000.0 | NaN |
2 | Los Angeles, CA | 2010-02 | NaN | NaN |
3 | Chicago, IL | 2010-02 | 1500.0 | NaN |
4 | Dallas, TX | 2010-02 | NaN | NaN |
rental_prices.shape
(12669, 4)
There are only 12,669 region-date pairs in the merged data. That's less than either the studio or the one-bedroom data.
An Alternative Approach¶
So far we've been merging dataframes using the pandas package (ie, pd.merge()
) but since dataframes are pandas objects, you can also merge dataframes using the method .merge()
. For example, we could have done the same inner
merge with the following code:
rental_prices = studios.merge(onebeds, on=['region', 'date'], how='inner')
rental_prices.head()
region | date | studio_price | onebed_price | |
---|---|---|---|---|
0 | United States | 2010-02 | 1500.0 | NaN |
1 | New York, NY | 2010-02 | 2000.0 | NaN |
2 | Los Angeles, CA | 2010-02 | NaN | NaN |
3 | Chicago, IL | 2010-02 | 1500.0 | NaN |
4 | Dallas, TX | 2010-02 | NaN | NaN |
rental_prices.shape
(12669, 4)
The difference is that in taking this approach we're starting with the studios
data frame as our "left" and onebeds
is implicitly our "right" dataframe. All other logic regarding how the merge / join works is the same.
Practice¶
Try joining by left
, right
, and outer
. What happens to the number of observations?
for join in ['inner','left','right','outer']:
df = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how=join)
print("'" + join + "'" + f' join yields {df.shape[0]:,d} observations.')
'inner' join yields 12,669 observations. 'left' join yields 17,368 observations. 'right' join yields 23,072 observations. 'outer' join yields 27,771 observations.
Set the index to region and date¶
Back to the case where we're using the 'inner' join. Let's set the index as region and date.
rental_prices = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how='inner')
rental_prices['date'] = pd.to_datetime(rental_prices['date']) # set the date column to datetime objects
rental_prices.set_index(['region', 'date'], inplace=True) # set up the index and sort it!
rental_prices.sort_index(axis=0, inplace=True)
rental_prices.head()
studio_price | onebed_price | ||
---|---|---|---|
region | date | ||
Abilene, TX | 2010-02-01 | NaN | NaN |
2010-03-01 | NaN | NaN | |
2010-04-01 | NaN | NaN | |
2010-05-01 | NaN | NaN | |
2010-06-01 | NaN | NaN |
rental_prices.loc['United States']
studio_price | onebed_price | |
---|---|---|
date | ||
2010-02-01 | 1500.0 | NaN |
2010-03-01 | 1500.0 | 1200.0 |
2010-04-01 | 1500.0 | 1245.0 |
2010-05-01 | 1500.0 | 1200.0 |
2010-06-01 | 1500.0 | 1250.0 |
... | ... | ... |
2018-04-01 | 1520.0 | 1434.0 |
2018-05-01 | 1550.0 | 1450.0 |
2018-06-01 | 1550.0 | 1450.0 |
2018-07-01 | 1550.0 | 1460.0 |
2018-08-01 | 1550.0 | 1460.0 |
103 rows × 2 columns
Try an outer merge¶
Notice that 2010-01 is missing, even though the studio data have an entry for it. Let's try an outer join.
# The only difference from the previous merge is the switch to 'outer'
rental_prices = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how='outer')
rental_prices['date'] = pd.to_datetime(rental_prices['date'])
rental_prices.set_index(['region', 'date'], inplace=True)
rental_prices.sort_index(axis=0, inplace=True)
rental_prices.loc['United States']
studio_price | onebed_price | |
---|---|---|
date | ||
2010-01-01 | 1300.0 | NaN |
2010-02-01 | 1500.0 | NaN |
2010-03-01 | 1500.0 | 1200.0 |
2010-04-01 | 1500.0 | 1245.0 |
2010-05-01 | 1500.0 | 1200.0 |
... | ... | ... |
2018-04-01 | 1520.0 | 1434.0 |
2018-05-01 | 1550.0 | 1450.0 |
2018-06-01 | 1550.0 | 1450.0 |
2018-07-01 | 1550.0 | 1460.0 |
2018-08-01 | 1550.0 | 1460.0 |
104 rows × 2 columns
That makes more sense for us.
# Peel off the data for Atlanta
ga = rental_prices.loc['Atlanta, GA']
avg_studio = ga['studio_price'].mean()
avg_onebed = ga['onebed_price'].mean()
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(ga.index, ga['studio_price'], color='blue')
ax.plot(ga.index, ga['onebed_price'], color='red')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Median Rental Prices: Atlanta, GA',fontsize=24)
ax.text(dt.datetime(2016,7,10), 1320, 'Studio \n($'+ f'{round(avg_studio,2):,}' +')' ,ha='center',fontsize=18,color='blue')
ax.text(dt.datetime(2017,10,10), 1110, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=18,color='red')
plt.show()
It's interesting that studio rental prices exceed one bedroom prices after 2015. I wonder whyt? Note that I put the average median rental price of each line in the label using a different-looking format syntax. BTW I like putting summary statistics in labels. I think it's a nice way to convey information.
Let's also take a look at Florida. Maybe the trends are different.
miami = rental_prices.loc['Miami, FL']
tampa = rental_prices.loc['Tampa, FL']
avg_studio = miami['studio_price'].mean()
avg_onebed = miami['onebed_price'].mean()
fig, ax = plt.subplots(1, 2, figsize=(15,8))
ax[0].plot(miami.index, miami['studio_price'], color='blue')
ax[0].plot(miami.index, miami['onebed_price'], color='red')
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].set_ylim(400,2200)
ax[0].set_title('Median rental prices: Miami, FL')
#ax[0].text('03/01/2015', 1400, 'one bedroom')
#ax[0].text('09/10/2015', 1960, 'studio')
ax[0].set_title('Median Rental Prices: Miami, FL',fontsize=18)
ax[0].text(dt.datetime(2015,12,31), 2000, 'Studio \n($'+ f'{round(avg_studio,2):,}' +')' ,ha='center',fontsize=14,color='blue')
ax[0].text(dt.datetime(2016,12,31), 1350, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=14,color='red')
ax[1].plot(tampa.index, tampa['studio_price'], color='blue')
ax[1].plot(tampa.index, tampa['onebed_price'], color='red')
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].set_ylim(400,2200)
avg_studio = tampa['studio_price'].mean()
avg_onebed = tampa['onebed_price'].mean()
ax[1].set_title('Median Rental Prices: Tampa, FL',fontsize=18)
ax[1].text(dt.datetime(2015,9,10), 1450, 'Studio \n($'+ f'{round(avg_studio,2):,}' +')' ,ha='center',fontsize=14,color='blue')
ax[1].text(dt.datetime(2016,3,1), 700, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=14,color='red')
plt.show()
Interesting...one bedroom apartments are always cheaper than studios whereas in Atlanta there was a crossing between the two somewhere in 2015. Maybe this reflects studios being located in more desirable neighborhoods?
Practice¶
I downloaded data from MovieLens. There are five files in your data folder.
- Load the
movies.csv
and theratings.csv
files as DataFrames. The movies file lists movie names and genre while the ratings file includes ratings (think Amazon "stars") by different viewers. NB, both files are in a sub-folder cleverly called "MovieLens."
movies = pd.read_csv('./Data/MovieLens/movies.csv')
movies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
ratings = pd.read_csv('./Data/MovieLens/ratings.csv')
ratings.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
- What size are the two DataFrames?
print(movies.shape)
print(ratings.shape)
(9742, 3) (100836, 4)
- There are many ratings per movie. Merge the two files on the movieId. What kind of join should you use?
# Each movie shows up several times in the ratings file (each movie is reviewed by many raters)
# We use a left join to keep all the observations in the ratings file (an inner would work as well)
data = pd.merge(left=ratings, right=movies, on='movieId', how='left')
data.head()
userId | movieId | rating | timestamp | title | genres | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 1 | 3 | 4.0 | 964981247 | Grumpier Old Men (1995) | Comedy|Romance |
2 | 1 | 6 | 4.0 | 964982224 | Heat (1995) | Action|Crime|Thriller |
3 | 1 | 47 | 5.0 | 964983815 | Seven (a.k.a. Se7en) (1995) | Mystery|Thriller |
4 | 1 | 50 | 5.0 | 964982931 | Usual Suspects, The (1995) | Crime|Mystery|Thriller |
- What size is the resulting DataFrame?
print(data.shape)
(100836, 6)
- Collapse the data so you have the median rating for each movie.
data = data.groupby(['movieId','title','genres'],as_index=False).agg({'rating':'median'})
data.head()
movieId | title | genres | rating | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 4.0 |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.5 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 3.0 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 3.0 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 3.0 |
- Create a histogram of the ratings (i.e.,
ax.hist()
). Use 10 bins. Use thedensity=True
parameter to plot the density rather than the counts.
fig, ax = plt.subplots(figsize=(15,5))
ax.hist(data['rating'], bins=10, color = 'blue', alpha = 0.25, density=True)
ax.set_title('MovieLens Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Density')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
- Load the
tags.csv
file. It's another file from MovieLens. These are topics or themes covered in each movie so each movie will contain many tags.
tags = pd.read_csv('./Data/MovieLens/tags.csv')
tags.drop(['userId','timestamp'],axis=1,inplace=True) # drop userid and timestamp
tags.drop_duplicates(subset=['movieId','tag'], keep='first', inplace=True) # we haven't covered this function.
# I'm keeping only the first occurrence
# of each tag for each movie.
tags.head()
movieId | tag | |
---|---|---|
0 | 60756 | funny |
1 | 60756 | Highly quotable |
2 | 60756 | will ferrell |
3 | 89774 | Boxing story |
4 | 89774 | MMA |
- Merge it into your
MovieLens
data. What keys should you use? Hint: Look at the column names. What type of join?
df = pd.merge(left=data, right=tags, on=['movieId'], how='left')
df.head()
movieId | title | genres | rating | tag | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 4.0 | pixar |
1 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 4.0 | fun |
2 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.5 | fantasy |
3 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.5 | magic board game |
4 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.5 | Robin Williams |
Let's get a look at the kinds of tags we have. Try the code below, which introduces us to the value_counts()
method of DataFrame.
print(df['tag'].value_counts())
print(df['tag'].value_counts())
In Netflix queue 119 atmospheric 32 superhero 22 religion 22 Disney 22 ... free speech 1 south park 1 Trey Parker 1 best comedy 1 remaster 1 Name: tag, Length: 1584, dtype: int64
I don't know what 'atmospheric' means -- sounds fancy -- while 'Heoric Bloodshed' sounds terrifying. That tag only appears once and applied to John Wick. Makes sense, I guess. Let's take a look at the tag 'superhero'.
- Compute the average rating for all movies and for movies tagged 'superhero'.
superheros = df[df['tag']=='superhero']
super_mean = superheros['rating'].mean()
all_mean = df['rating'].mean()
print(f'The average rating for superhero movies is {super_mean:4.2f} compared to {all_mean:4.2f} for all movies.')
The average rating for superhero movies is 3.24 compared to 3.40 for all movies.
- Let's see how the ratings of superhero movies compare to all movies. We will plot two histograms on the same axes.
- Plot the histogram of movies tagged 'superhero'. Use 10 bins. Make the plot blue. Set
alpha=0.25
. - Plot the histogram of all movies. Use 10 bins. Make the plot red. Set
alpha=0.25
. - Add a legend:
plt.legend()
. Put it in a spot that makes sense using theloc=
command inplt.legend()
. - Add the mean ratings for all movies and for superhero movies as text to the histogram:
ax.annotate()
. - Get rid of the top and right spines:
ax.spines['top'].set_visible(False)
- Add axes and a chart title that make sense.
- Plot the histogram of movies tagged 'superhero'. Use 10 bins. Make the plot blue. Set
fig, ax = plt.subplots(figsize=(15,5))
# Two histograms one-atop the other
ax.hist(superheros['rating'], bins=10, color = 'blue', alpha = 0.25, density=True, label = 'Tagged superhero')
ax.hist(data['rating'], bins=10, color = 'red', alpha = 0.25, density=True, label = 'All movies')
ax.set_title('MovieLens Ratings',size=26,pad=30) # "pad" adds space under the title
ax.set_xlabel('Rating',size=20)
ax.set_ylabel('Density\n',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.text( 0.5, 0.9, f'Mean rating for all movies: {all_mean:4.2f}',fontsize=18)
ax.text( 0.5, 0.8, f'Mean rating for superhero movies: {super_mean:4.2f}',fontsize=18)
plt.xticks(size=18)
plt.yticks(size=18)
plt.legend(frameon=False,loc='upper left',fontsize=20)
plt.show()