Lecture 14: Cleaning and transforming data [SUGGESTED SOLUTIONS]
We now know how to create some complicated datasets. In this notebook we will continue practicing techniques for manipulating data into forms that are amenable to analysis. We will also learn about the American Time Use Survey which is a really cool dataset.
The outline of today's lecture is as follows:
import pandas as pd # pandas for data handling
import matplotlib.pyplot as plt # matplotlib for plots
import numpy as np # numpy for numerical methods
1. American Time Use Survey (ATUS) (return)¶
The Bureau of Labor Statistics oversees the American Time Use Survey, which asks a sample of Americans to complete detailed diaries keeping track of each minute of their day.
Follow this link https://www.bls.gov/tus/datafiles_2017.htm to the page for the 2017 survey. Download the ATUS 2017 Activity summary file (zip) file located in the 2017 Basic ATUS Data Files section of the page. Alternatively, download it directly https://www.bls.gov/tus/special.requests/atussum_2017.zip.
Unzip the file. We are looking for atussum_2017.dat
. It is a coma separated file. Let's get it loaded.
atus = pd.read_csv('./Data/atussum_2017/atussum_2017.dat')
atus.head()
TUCASEID | TUFINLWGT | TRYHHCHILD | TEAGE | TESEX | PEEDUCA | PTDTRACE | PEHSPNON | GTMETSTA | TELFS | ... | t181601 | t181699 | t181801 | t181899 | t189999 | t500101 | t500103 | t500105 | t500106 | t500107 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20170101170002 | 5.217551e+06 | 15 | 34 | 2 | 39 | 2 | 2 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 20170101170012 | 3.106847e+06 | 7 | 28 | 2 | 40 | 1 | 2 | 1 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 20170101170027 | 1.653409e+07 | 15 | 15 | 1 | 35 | 3 | 2 | 1 | 5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 20170101170030 | 1.757071e+07 | -1 | 46 | 1 | 39 | 1 | 2 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 20170101170033 | 1.022275e+06 | -1 | 85 | 1 | 44 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 421 columns
421 variables! That's too many for us today. Let's just keep a few.
col_list=atus.columns.tolist() # columns returns a list object. Turn it into a list for easier printing.
# Print the first and last 20 variables.
print(col_list[0:20],'\n')
print(col_list[-20:])
['TUCASEID', 'TUFINLWGT', 'TRYHHCHILD', 'TEAGE', 'TESEX', 'PEEDUCA', 'PTDTRACE', 'PEHSPNON', 'GTMETSTA', 'TELFS', 'TEMJOT', 'TRDPFTPT', 'TESCHENR', 'TESCHLVL', 'TRSPPRES', 'TESPEMPNOT', 'TRERNWA', 'TRCHILDNUM', 'TRSPFTPT', 'TEHRUSLT'] ['t181204', 't181205', 't181299', 't181301', 't181302', 't181399', 't181401', 't181499', 't181501', 't181599', 't181601', 't181699', 't181801', 't181899', 't189999', 't500101', 't500103', 't500105', 't500106', 't500107']
The demographic variables are all uppercase letters. The time variables are of the form 'txxyyzz' where xx is the major category code, yy is the second-tier code, and zz is the third-tier code. (docs)
Let's keep some demo data and some data about working and sleeping.
vars_to_keep = ['TEAGE', 'TESEX', 'PTDTRACE', 'PEEDUCA', 'GTMETSTA', 'TELFS',
'TUDIARYDAY', 't050101', 't050102', 't050201', 't010101', 't010102']
atus_small = atus[vars_to_keep]
atus_small.head()
TEAGE | TESEX | PTDTRACE | PEEDUCA | GTMETSTA | TELFS | TUDIARYDAY | t050101 | t050102 | t050201 | t010101 | t010102 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | 2 | 2 | 39 | 1 | 1 | 1 | 450 | 0 | 0 | 728 | 0 |
1 | 28 | 2 | 1 | 40 | 1 | 5 | 7 | 0 | 0 | 0 | 385 | 0 |
2 | 15 | 1 | 3 | 35 | 1 | 5 | 4 | 0 | 0 | 0 | 570 | 0 |
3 | 46 | 1 | 1 | 39 | 1 | 1 | 2 | 480 | 0 | 0 | 525 | 0 |
4 | 85 | 1 | 1 | 44 | 1 | 1 | 7 | 0 | 0 | 0 | 756 | 0 |
Let's give the columns some more reasonable names.
# See https://www.bls.gov/tus/freqvariables.pdf for frequently used variables
atus_small = atus_small.rename(columns={'TEAGE':'age', 'TESEX':'gender', 'PTDTRACE':'race',
'PEEDUCA':'edu', 'GTMETSTA':'metro', 'TELFS':'employ', 'TUDIARYDAY':'day'})
atus_small.head()
age | gender | race | edu | metro | employ | day | t050101 | t050102 | t050201 | t010101 | t010102 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | 2 | 2 | 39 | 1 | 1 | 1 | 450 | 0 | 0 | 728 | 0 |
1 | 28 | 2 | 1 | 40 | 1 | 5 | 7 | 0 | 0 | 0 | 385 | 0 |
2 | 15 | 1 | 3 | 35 | 1 | 5 | 4 | 0 | 0 | 0 | 570 | 0 |
3 | 46 | 1 | 1 | 39 | 1 | 1 | 2 | 480 | 0 | 0 | 525 | 0 |
4 | 85 | 1 | 1 | 44 | 1 | 1 | 7 | 0 | 0 | 0 | 756 | 0 |
# See https://www.bls.gov/tus/lexiconnoex2017.pdf for definitions
atus_small = atus_small.rename(columns={'t050101':'work_main', 't050102':'work_other', 't010101':'sleep',
't050201':'work_soc', 't010102':'no_sleep'})
Okay, we have variable names that we can live with. Let's start transforming the data to get it ready for analysis.
A word on loops¶
We have seen many places where a loop saved us time and effort. It might seem natural to use a loop to perform an operation on each element of a column or a DataFrame.
In general, however, we want to avoid this. Instead, we have used pandas vectorized operations such as
x['new_var'] = x['var_1'] / x['var_2']
to perform element-wise division. Using pandas' native operations is much (much much) faster than looping over the rows of a DataFrame. A lot of optimization has been written into the native functions that is not there when we loop ourselves.
Fortunately, pandas provides methods that let us to very complex and very general operations to a DataFrame without resorting to a loop. We consider a few of these below.
replace( )¶
The gender variable is coded 1 for male and 2 for female. I do not want to have to remember that!
The .replace()
method replaces one value for another. One syntax is
atus_small['gender'] = atus_small['gender'].replace(1, 'male')
but a more powerful one passes a dict or a list.
atus_small['gender'] = atus_small['gender'].replace({1:'male', 2:'female'})
gender_codes = {1:'male', 2:'female'}
atus_small['gender'] = atus_small['gender'].replace(gender_codes)
atus_small.head()
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | 2 | 39 | 1 | 1 | 1 | 450 | 0 | 0 | 728 | 0 |
1 | 28 | female | 1 | 40 | 1 | 5 | 7 | 0 | 0 | 0 | 385 | 0 |
2 | 15 | male | 3 | 35 | 1 | 5 | 4 | 0 | 0 | 0 | 570 | 0 |
3 | 46 | male | 1 | 39 | 1 | 1 | 2 | 480 | 0 | 0 | 525 | 0 |
4 | 85 | male | 1 | 44 | 1 | 1 | 7 | 0 | 0 | 0 | 756 | 0 |
unique( )¶
Let's code race, too. What codes are in our data? The method .unique()
returns all unique unique values.
races = atus_small['race'].unique()
races.sort()
print(races)
[ 1 2 3 4 5 6 7 8 9 10 11 13 16 18]
# I looked up the codes in the documentation...
# https://www.bls.gov/tus/atuscpscodebk0314.pdf
race_codes = {1:'white', 2:'black', 3:'native_am', 4:'asian', 5:'hawaiian', 6:'wh_bl',
7:'wh_na', 8:'wh_as', 9:'wh_ha', 10:'bl_na', 11:'bl_as', 13:'na_as', 16:'wh_bl_as', 18:'wh_as_ha'}
atus_small['race'] = atus_small['race'].replace(race_codes)
atus_small.head()
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | 39 | 1 | 1 | 1 | 450 | 0 | 0 | 728 | 0 |
1 | 28 | female | white | 40 | 1 | 5 | 7 | 0 | 0 | 0 | 385 | 0 |
2 | 15 | male | native_am | 35 | 1 | 5 | 4 | 0 | 0 | 0 | 570 | 0 |
3 | 46 | male | white | 39 | 1 | 1 | 2 | 480 | 0 | 0 | 525 | 0 |
4 | 85 | male | white | 44 | 1 | 1 | 7 | 0 | 0 | 0 | 756 | 0 |
Apply a function to a column: map()¶
We can apply functions to columns, too. The .map()
functions handles this for us. It applies the given function to each element of the column. These can be built in functions, or user-defined functions. This is super powerful. For example, we can write a function that performs a complicated transformation and apply to each element of a column in one simple line.
Let's define a function that converts minutes to hours.
def minutes_to_hours(x):
return x/60
Now apply .map()
to the work
column.
And yes, this example is a bit contrived (i.e., stupid) since we could have just as easily done atus_small['work_main']/60
. But that wouldn't have given us practice with .map()
! I guess, the glass-half-full perspective is that we could do both and compare.
atus_small['work_main'] = atus_small['work_main'].map(minutes_to_hours)
atus_small.head()
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | 39 | 1 | 1 | 1 | 7.5 | 0 | 0 | 728 | 0 |
1 | 28 | female | white | 40 | 1 | 5 | 7 | 0.0 | 0 | 0 | 385 | 0 |
2 | 15 | male | native_am | 35 | 1 | 5 | 4 | 0.0 | 0 | 0 | 570 | 0 |
3 | 46 | male | white | 39 | 1 | 1 | 2 | 8.0 | 0 | 0 | 525 | 0 |
4 | 85 | male | white | 44 | 1 | 1 | 7 | 0.0 | 0 | 0 | 756 | 0 |
Apply a function to a DataFrame: applymap( )¶
If we want to apply the same function to several columns of a DataFrame (rather than a series as we did above) we use .applymap()
. It works the same way, applying the function to each element.
# We can map to several columns at once.
atus_small[['work_other', 'sleep', 'no_sleep']] = atus_small[['work_other', 'sleep', 'no_sleep']].applymap(minutes_to_hours)
atus_small.head()
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | 39 | 1 | 1 | 1 | 7.5 | 0.0 | 0 | 12.133333 | 0.0 |
1 | 28 | female | white | 40 | 1 | 5 | 7 | 0.0 | 0.0 | 0 | 6.416667 | 0.0 |
2 | 15 | male | native_am | 35 | 1 | 5 | 4 | 0.0 | 0.0 | 0 | 9.500000 | 0.0 |
3 | 46 | male | white | 39 | 1 | 1 | 2 | 8.0 | 0.0 | 0 | 8.750000 | 0.0 |
4 | 85 | male | white | 44 | 1 | 1 | 7 | 0.0 | 0.0 | 0 | 12.600000 | 0.0 |
Practice</font >
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.
educ
holds the highest level of education obtained. It can take values between 31 and 46. Covert the numeric values to labels.- codes < 39 are 'less than high'
- 39 = 'high school'
- 40 = 'some college'
- 41 & 42 are 'associate'
- 43 = 'bachelor'
- 44 = 'master'
- 45 = 'prof'
- 46 = 'phd'
educ_codes = {39:'high school', 40:'some college', 41:'associate', 42:'associate',
43: 'bachelor', 44: 'master', 45:'prof', 46:'phd'}
atus_small.loc[ atus_small['edu']<39, 'edu'] = 'less than high'
atus_small['edu'] = atus_small['edu'].replace(educ_codes)
- Make sure you covered all your bases. Print out a list of the unique values from the column 'edu'.
atus_small['edu'].unique()
atus_small.loc[10218,'edu']
'associate'
- Apply the
minutes_to_hours
function to thework_soc
variable. This variable is the time spent 'Socializing, relaxing, and leisure as part of job'. I have no idea what kind of jobs make you relax and take leisure.
atus_small['work_soc'] = atus_small['work_soc'].map(minutes_to_hours)
atus_small.head()
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | high school | 1 | 1 | 1 | 7.5 | 0.0 | 0.0 | 12.133333 | 0.0 |
1 | 28 | female | white | some college | 1 | 5 | 7 | 0.0 | 0.0 | 0.0 | 6.416667 | 0.0 |
2 | 15 | male | native_am | less than high | 1 | 5 | 4 | 0.0 | 0.0 | 0.0 | 9.500000 | 0.0 |
3 | 46 | male | white | high school | 1 | 1 | 2 | 8.0 | 0.0 | 0.0 | 8.750000 | 0.0 |
4 | 85 | male | white | master | 1 | 1 | 7 | 0.0 | 0.0 | 0.0 | 12.600000 | 0.0 |
- Create a column named 'work' that is the sum of 'work_main' and 'work_other'.
atus_small['work'] = atus_small['work_main'] + atus_small['work_other']
- Create a histogram of 'work' for only those observations with work>0. Use 40 bins, make the bars red, and set
alpha=0.4
.
fig, ax = plt.subplots(figsize = (15,10))
ax.hist()
fig, ax = plt.subplots(figsize=(15,10))
ax.hist(atus_small.loc[atus_small['work']>0, 'work'], bins=40, alpha = 0.4, color = 'red')
ax.set_xlabel('hours of work')
ax.set_title('Distribution of hours worked, conditional on working at least one minute')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Binning¶
Often we would like to take a continuous variable and make it discrete. One way to do this is to place the data into 'bins' or 'buckets'. (This is implicitly being done when we create a histogram.) Pandas has a nice facility for this. In fact, we've seen these (cut
and qcut
) before.
# What range are the ages?
atus_small['age'].describe()
count 10223.000000 mean 49.876846 std 18.008395 min 15.000000 25% 35.000000 50% 50.000000 75% 64.000000 max 85.000000 Name: age, dtype: float64
# 15 to 85 years
bins = [15, 25, 35, 45, 55, 65, 75] # These are the endpoints for the bins
atus_small['age_bin'] = pd.cut(atus_small['age'], bins)
atus_small.head(10)
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | work | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | high school | 1 | 1 | 1 | 7.50 | 0.000000 | 0.0 | 12.133333 | 0.0 | 7.500000 | (25.0, 35.0] |
1 | 28 | female | white | some college | 1 | 5 | 7 | 0.00 | 0.000000 | 0.0 | 6.416667 | 0.0 | 0.000000 | (25.0, 35.0] |
2 | 15 | male | native_am | less than high | 1 | 5 | 4 | 0.00 | 0.000000 | 0.0 | 9.500000 | 0.0 | 0.000000 | NaN |
3 | 46 | male | white | high school | 1 | 1 | 2 | 8.00 | 0.000000 | 0.0 | 8.750000 | 0.0 | 8.000000 | (45.0, 55.0] |
4 | 85 | male | white | master | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 12.600000 | 0.0 | 0.000000 | NaN |
5 | 60 | male | white | less than high | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 9.750000 | 0.0 | 0.000000 | (55.0, 65.0] |
6 | 44 | male | white | high school | 2 | 1 | 2 | 8.25 | 0.000000 | 0.0 | 5.750000 | 0.0 | 8.250000 | (35.0, 45.0] |
7 | 49 | female | black | bachelor | 1 | 1 | 3 | 0.00 | 7.433333 | 0.0 | 6.500000 | 0.0 | 7.433333 | (45.0, 55.0] |
8 | 63 | male | white | some college | 1 | 1 | 1 | 1.50 | 0.000000 | 0.0 | 7.666667 | 0.0 | 1.500000 | (55.0, 65.0] |
9 | 70 | female | white | master | 1 | 5 | 6 | 0.00 | 0.000000 | 0.0 | 8.000000 | 0.0 | 0.000000 | (65.0, 75.0] |
Notice the NaN for observation 4. Our lowest bin starts at 15, but it is open on the left end point --- 15 is not included in the bin. We could set the bottom interval to 14 to catch this observation.
bins = [14, 25, 35, 45, 55, 65, 75] # These are the endpoints for the bins
atus_small['age_bin'] = pd.cut(atus_small['age'], bins)
atus_small.head(10)
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | work | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | high school | 1 | 1 | 1 | 7.50 | 0.000000 | 0.0 | 12.133333 | 0.0 | 7.500000 | (25.0, 35.0] |
1 | 28 | female | white | some college | 1 | 5 | 7 | 0.00 | 0.000000 | 0.0 | 6.416667 | 0.0 | 0.000000 | (25.0, 35.0] |
2 | 15 | male | native_am | less than high | 1 | 5 | 4 | 0.00 | 0.000000 | 0.0 | 9.500000 | 0.0 | 0.000000 | (14.0, 25.0] |
3 | 46 | male | white | high school | 1 | 1 | 2 | 8.00 | 0.000000 | 0.0 | 8.750000 | 0.0 | 8.000000 | (45.0, 55.0] |
4 | 85 | male | white | master | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 12.600000 | 0.0 | 0.000000 | NaN |
5 | 60 | male | white | less than high | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 9.750000 | 0.0 | 0.000000 | (55.0, 65.0] |
6 | 44 | male | white | high school | 2 | 1 | 2 | 8.25 | 0.000000 | 0.0 | 5.750000 | 0.0 | 8.250000 | (35.0, 45.0] |
7 | 49 | female | black | bachelor | 1 | 1 | 3 | 0.00 | 7.433333 | 0.0 | 6.500000 | 0.0 | 7.433333 | (45.0, 55.0] |
8 | 63 | male | white | some college | 1 | 1 | 1 | 1.50 | 0.000000 | 0.0 | 7.666667 | 0.0 | 1.500000 | (55.0, 65.0] |
9 | 70 | female | white | master | 1 | 5 | 6 | 0.00 | 0.000000 | 0.0 | 8.000000 | 0.0 | 0.000000 | (65.0, 75.0] |
We could also make the bins closed on the left and opened on the right by passing 'right=False'
parameter to cut( )
.
bins = [15, 25, 35, 45, 55, 65, 75] # These are the endpoints for the bins
atus_small['age_bin'] = pd.cut(atus_small['age'], bins, right=False)
atus_small.head(10)
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | work | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | high school | 1 | 1 | 1 | 7.50 | 0.000000 | 0.0 | 12.133333 | 0.0 | 7.500000 | [25.0, 35.0) |
1 | 28 | female | white | some college | 1 | 5 | 7 | 0.00 | 0.000000 | 0.0 | 6.416667 | 0.0 | 0.000000 | [25.0, 35.0) |
2 | 15 | male | native_am | less than high | 1 | 5 | 4 | 0.00 | 0.000000 | 0.0 | 9.500000 | 0.0 | 0.000000 | [15.0, 25.0) |
3 | 46 | male | white | high school | 1 | 1 | 2 | 8.00 | 0.000000 | 0.0 | 8.750000 | 0.0 | 8.000000 | [45.0, 55.0) |
4 | 85 | male | white | master | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 12.600000 | 0.0 | 0.000000 | NaN |
5 | 60 | male | white | less than high | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 9.750000 | 0.0 | 0.000000 | [55.0, 65.0) |
6 | 44 | male | white | high school | 2 | 1 | 2 | 8.25 | 0.000000 | 0.0 | 5.750000 | 0.0 | 8.250000 | [35.0, 45.0) |
7 | 49 | female | black | bachelor | 1 | 1 | 3 | 0.00 | 7.433333 | 0.0 | 6.500000 | 0.0 | 7.433333 | [45.0, 55.0) |
8 | 63 | male | white | some college | 1 | 1 | 1 | 1.50 | 0.000000 | 0.0 | 7.666667 | 0.0 | 1.500000 | [55.0, 65.0) |
9 | 70 | female | white | master | 1 | 5 | 6 | 0.00 | 0.000000 | 0.0 | 8.000000 | 0.0 | 0.000000 | [65.0, 75.0) |
We know from before that cut( )
can create bins for us, too. Pass cut an integer and get back that many equally-sized bins. The precision
parameter determines how many decimal places are used in the bin edges.
atus_small['age_bin'] = pd.cut(atus_small['age'], 3, precision = 0)
atus_small.head(10)
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | work | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | high school | 1 | 1 | 1 | 7.50 | 0.000000 | 0.0 | 12.133333 | 0.0 | 7.500000 | (15.0, 38.0] |
1 | 28 | female | white | some college | 1 | 5 | 7 | 0.00 | 0.000000 | 0.0 | 6.416667 | 0.0 | 0.000000 | (15.0, 38.0] |
2 | 15 | male | native_am | less than high | 1 | 5 | 4 | 0.00 | 0.000000 | 0.0 | 9.500000 | 0.0 | 0.000000 | (15.0, 38.0] |
3 | 46 | male | white | high school | 1 | 1 | 2 | 8.00 | 0.000000 | 0.0 | 8.750000 | 0.0 | 8.000000 | (38.0, 62.0] |
4 | 85 | male | white | master | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 12.600000 | 0.0 | 0.000000 | (62.0, 85.0] |
5 | 60 | male | white | less than high | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 9.750000 | 0.0 | 0.000000 | (38.0, 62.0] |
6 | 44 | male | white | high school | 2 | 1 | 2 | 8.25 | 0.000000 | 0.0 | 5.750000 | 0.0 | 8.250000 | (38.0, 62.0] |
7 | 49 | female | black | bachelor | 1 | 1 | 3 | 0.00 | 7.433333 | 0.0 | 6.500000 | 0.0 | 7.433333 | (38.0, 62.0] |
8 | 63 | male | white | some college | 1 | 1 | 1 | 1.50 | 0.000000 | 0.0 | 7.666667 | 0.0 | 1.500000 | (62.0, 85.0] |
9 | 70 | female | white | master | 1 | 5 | 6 | 0.00 | 0.000000 | 0.0 | 8.000000 | 0.0 | 0.000000 | (62.0, 85.0] |
pd.value_counts(atus_small['age_bin']) # How many observations in each bin?
(38.0, 62.0] 4033 (15.0, 38.0] 3176 (62.0, 85.0] 3014 Name: age_bin, dtype: int64
Quantiles¶
Notice that the bins are equally spaced from the smallest to the largest values of age. The number of observations in each is not equal.
The method qcut( )
splits the data according to the sample quantiles. This should get you bins of approximately equal numbers of observations.
atus_small['age_bin'] = pd.qcut(atus_small['age'], 4) # create quartiles based on age
atus_small.head(10)
age | gender | race | edu | metro | employ | day | work_main | work_other | work_soc | sleep | no_sleep | work | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34 | female | black | high school | 1 | 1 | 1 | 7.50 | 0.000000 | 0.0 | 12.133333 | 0.0 | 7.500000 | (14.999, 35.0] |
1 | 28 | female | white | some college | 1 | 5 | 7 | 0.00 | 0.000000 | 0.0 | 6.416667 | 0.0 | 0.000000 | (14.999, 35.0] |
2 | 15 | male | native_am | less than high | 1 | 5 | 4 | 0.00 | 0.000000 | 0.0 | 9.500000 | 0.0 | 0.000000 | (14.999, 35.0] |
3 | 46 | male | white | high school | 1 | 1 | 2 | 8.00 | 0.000000 | 0.0 | 8.750000 | 0.0 | 8.000000 | (35.0, 50.0] |
4 | 85 | male | white | master | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 12.600000 | 0.0 | 0.000000 | (64.0, 85.0] |
5 | 60 | male | white | less than high | 1 | 1 | 7 | 0.00 | 0.000000 | 0.0 | 9.750000 | 0.0 | 0.000000 | (50.0, 64.0] |
6 | 44 | male | white | high school | 2 | 1 | 2 | 8.25 | 0.000000 | 0.0 | 5.750000 | 0.0 | 8.250000 | (35.0, 50.0] |
7 | 49 | female | black | bachelor | 1 | 1 | 3 | 0.00 | 7.433333 | 0.0 | 6.500000 | 0.0 | 7.433333 | (35.0, 50.0] |
8 | 63 | male | white | some college | 1 | 1 | 1 | 1.50 | 0.000000 | 0.0 | 7.666667 | 0.0 | 1.500000 | (50.0, 64.0] |
9 | 70 | female | white | master | 1 | 5 | 6 | 0.00 | 0.000000 | 0.0 | 8.000000 | 0.0 | 0.000000 | (64.0, 85.0] |
pd.value_counts(atus_small['age_bin']) # How many observations in each bin?
(35.0, 50.0] 2644 (14.999, 35.0] 2560 (50.0, 64.0] 2517 (64.0, 85.0] 2502 Name: age_bin, dtype: int64
String methods¶
We have seen some of these before when we discussed strings. These are analogous to the string methods but set up for DataFrames. These vectorized string methods are optimized to work over an entire column. The method call looks like
data['var'].str.method()
where method( )
is the method we are applying. A list of vectorized string methods is on page 218 in McKinney. Below, we try a few out.
movies = pd.read_csv('./Data/MovieLens/movies.csv')
movies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
str.contains( )¶
The genres are mixed together. Let's get all the comedies. The .contains()
method returns a bool Series with True for observations in which the string contains the search term.
movies['genres'].str.contains('Comedy')
0 True 1 False 2 True 3 True 4 True ... 9737 True 9738 True 9739 False 9740 False 9741 True Name: genres, Length: 9742, dtype: bool
comedies = movies[movies['genres'].str.contains('Comedy')]
comedies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
6 | 7 | Sabrina (1995) | Comedy|Romance |
str.split( )¶
This method splits the string up at the delimiter that is passed to split()
. It returns a list of each chunk that falls between the delimiter.
# The movie genres are separated with the '|' character.
movies['genres'].str.split('|')
0 [Adventure, Animation, Children, Comedy, Fantasy] 1 [Adventure, Children, Fantasy] 2 [Comedy, Romance] 3 [Comedy, Drama, Romance] 4 [Comedy] ... 9737 [Action, Animation, Comedy, Fantasy] 9738 [Animation, Comedy, Fantasy] 9739 [Drama] 9740 [Action, Animation] 9741 [Comedy] Name: genres, Length: 9742, dtype: object
This could be useful processing name data that come in the form: last,first or city,state.
# Remember, DataFrames can have columns of lists...
movies['genre_split'] = movies['genres'].str.split('|')
movies.head(10)
movieId | title | genres | genre_split | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | [Adventure, Children, Fantasy] |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | [Comedy, Romance] |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | [Comedy, Drama, Romance] |
4 | 5 | Father of the Bride Part II (1995) | Comedy | [Comedy] |
5 | 6 | Heat (1995) | Action|Crime|Thriller | [Action, Crime, Thriller] |
6 | 7 | Sabrina (1995) | Comedy|Romance | [Comedy, Romance] |
7 | 8 | Tom and Huck (1995) | Adventure|Children | [Adventure, Children] |
8 | 9 | Sudden Death (1995) | Action | [Action] |
9 | 10 | GoldenEye (1995) | Action|Adventure|Thriller | [Action, Adventure, Thriller] |
str.join ( )¶
Put strings together... and then separate the pieces with a delimiter of your choosing. I end up using this method quite a bit.
movies['genre_split'].str.join('::')
0 Adventure::Animation::Children::Comedy::Fantasy 1 Adventure::Children::Fantasy 2 Comedy::Romance 3 Comedy::Drama::Romance 4 Comedy ... 9737 Action::Animation::Comedy::Fantasy 9738 Animation::Comedy::Fantasy 9739 Drama 9740 Action::Animation 9741 Comedy Name: genre_split, Length: 9742, dtype: object
Practice</font >
What era had the best movies? Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.
- ate a copy of the 'movies' DataFrame named 'movies_prac'. Work with 'movies_prac' for these exercises.
movies_prac = movies.copy() # the .copy() method creates another object
movies_prac = movies.copy()
- Get rid of any leading or trailing white space in the 'title variable by using the
.strip()
function.
movies_prac['title'] = movies_prac['title'].str.strip()
- Extract the four-digit year from the titles and put them into a new column named 'year'. You might try
str.slice()
.
movies_prac['year'] = movies_prac['title'].str.slice(-5, -1) # The negative notation counts backwards.
# I am relying on the fact that each title ends in '(year)'.
# Not the most robust method.
- There are 12 movies that do not have a year. Assign "-1" to those. I suggest using
.loc[]
and create a boolean vector to identifying these observations using~movies_prac['year'].str.isdigit()
movies_prac.loc[~movies_prac['year'].str.isdigit(),'year'] = -1 # The ~ is the boolean for "not"
# movies_prac['year'].str.isdigit() creates a series of TRUE and FALSE.
# ~movies_prac['year'].str.isdigit() picks out the entries with FALSE.
- Drop the movies without years.
movies_prac = movies_prac[ movies_prac['year']!=-1 ]
- Convert the year data to int.
movies_prac['year'] = movies_prac['year'].astype(int)
- 'Bin' the years into decades using
.cut()
using pre-defined labels:Create a column called 'decade' to store the binned values.decades = [1899, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]
decades = [1899, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]
movies_prac['decade'] = pd.cut(movies_prac['year'], decades)
movies_prac.head()
movieId | title | genres | genre_split | year | decade | |
---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | (1990, 2000] |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | [Adventure, Children, Fantasy] | 1995 | (1990, 2000] |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | [Comedy, Romance] | 1995 | (1990, 2000] |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | [Comedy, Drama, Romance] | 1995 | (1990, 2000] |
4 | 5 | Father of the Bride Part II (1995) | Comedy | [Comedy] | 1995 | (1990, 2000] |
- Load 'ratings.csv' from the MovieLens data and merge it to
movies_prac
to create a new DataFrame namedmovies_2
. Use an inner merge using 'movieId' as the key.
rate = pd.read_csv('./Data/MovieLens/ratings.csv')
movies_2 = pd.merge(left=movies_prac, right=rate, on='movieId', how='inner')
movies_2.head()
movieId | title | genres | genre_split | year | decade | userId | rating | timestamp | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | (1990, 2000] | 1 | 4.0 | 964982703 |
1 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | (1990, 2000] | 5 | 4.0 | 847434962 |
2 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | (1990, 2000] | 7 | 4.5 | 1106635946 |
3 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | (1990, 2000] | 15 | 2.5 | 1510577970 |
4 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | [Adventure, Animation, Children, Comedy, Fantasy] | 1995 | (1990, 2000] | 17 | 4.5 | 1305696483 |
- Run the following code. What does it do?
means = movies_2.groupby('decade')['rating'].mean()
means = movies_2.groupby('decade')['rating'].mean()
- Time to create a graphic to answer our question: Which era had the best Movies? Create a horizontal bar chart with the average movie ratings. The index of
means
is an interval object. To assign it to the y-axis labels, convert it to a stringax.barh(means.index.astype(str), means)
fig, ax = plt.subplots(figsize=(10,5))
ax.barh(means.index.astype(str), means, color = 'red', alpha = 0.25 )
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Average movie rating by decade')
ax.grid(axis = 'x', color='white')
plt.show()