Lecture 8: Reading \& writing data with Pandas [SUGGESTED SOLUTIONS]
We have seen some of the basic things we can do with Pandas. In doing so, we created some simple DataFrames from dicts. That was simple, but it is almost never how we create DataFrames in the wild.
Most data live in files, often as text files, comma-sparated values or as MS Excel workbooks either on our computers or in the cloud. In this notebook, we will learn how to get data into (and out of) Pandas.
Today, we will cover:
Class Announcements
PS2 (due 9/18) is on eLC.
Files Needed
The files below are located on the course website and should be downloaded to your working directory prior to class.
File | Description |
---|---|
gdp_components.csv | GDP components |
debt.xlsx | GDP, Federal Debt, 10-year treasury rates |
import pandas as pd # load the pandas package and call it pd
import os # The package name is already short enough. No need to rename it.
# The following is a bit of extra code to allow you to download the PISA scores from the internet at the end
# of the lecture. It's telling the computer to ignore ssl certificate errors.
import ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
If you have not already, move the gdp_components.csv
file to your local drive and put it in the same folder that holds this notebook. We expect this file to contain U.S. GDP and its major components. Let's see.
gdp = pd.read_csv('gdp_components.csv') # read_csv is a part of Pandas, so we need the pd.
print(type(gdp)) # What have we got here?
<class 'pandas.core.frame.DataFrame'>
This looks successful. read_csv()
takes a string with the file name and creates a DataFrame. Let's take a look at the data.
print(gdp)
DATE GDPA GPDIA GCEA EXPGSA IMPGSA 0 1929-01-01 104.556 17.170 9.622 5.939 5.556 1 1930-01-01 92.160 11.428 10.273 4.444 4.121 2 1931-01-01 77.391 6.549 10.169 2.906 2.905 3 1932-01-01 59.522 1.819 8.946 1.975 1.932 4 1933-01-01 57.154 2.276 8.875 1.987 1.929 .. ... ... ... ... ... ... 84 2013-01-01 16784.851 2826.013 3132.409 2273.428 2764.210 85 2014-01-01 17521.747 3038.931 3167.041 2371.027 2879.284 86 2015-01-01 18219.297 3211.971 3234.210 2265.047 2786.461 87 2016-01-01 18707.189 3169.887 3290.979 2217.576 2738.146 88 2017-01-01 19485.394 3367.965 3374.444 2350.175 2928.596 [89 rows x 6 columns]
Even though jupyter notebook hid rows 30-58, this is still a bit obnoxious. We can use the head()
and tail()
methods of DataFrame to peek at just the first or last few rows.
gdp.head(4) # Show the first 4 rows.
DATE | GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|---|
0 | 1929-01-01 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1 | 1930-01-01 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
2 | 1931-01-01 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
3 | 1932-01-01 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
If you do not pass head()
or tail()
an argument, it defaults to 5 rows.
gdp.tail()
DATE | GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|---|
84 | 2013-01-01 | 16784.851 | 2826.013 | 3132.409 | 2273.428 | 2764.210 |
85 | 2014-01-01 | 17521.747 | 3038.931 | 3167.041 | 2371.027 | 2879.284 |
86 | 2015-01-01 | 18219.297 | 3211.971 | 3234.210 | 2265.047 | 2786.461 |
87 | 2016-01-01 | 18707.189 | 3169.887 | 3290.979 | 2217.576 | 2738.146 |
88 | 2017-01-01 | 19485.394 | 3367.965 | 3374.444 | 2350.175 | 2928.596 |
You can also ask for a random sample of the rows using .sample()
where the input argument is, you guessed it, the number of rows. sample()
doesn't include a default value so you must put something in it (e.g., 5).
gdp.sample(5)
DATE | GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|---|
39 | 1968-01-01 | 940.651 | 156.923 | 225.526 | 47.906 | 46.556 |
43 | 1972-01-01 | 1279.110 | 228.140 | 286.189 | 70.843 | 74.216 |
80 | 2009-01-01 | 14448.932 | 1929.664 | 3073.512 | 1581.996 | 1978.447 |
46 | 1975-01-01 | 1684.904 | 257.253 | 381.128 | 138.706 | 122.730 |
7 | 1936-01-01 | 84.830 | 9.391 | 13.398 | 3.007 | 3.154 |
The index isn't very sensible. This is time series data (the unit of observation is a year), so the date seems like a good index. How do we set the index?
gdp_new_index = gdp.set_index('DATE') # We could use 'inplace = True' if we didn't need a copy.
print(gdp_new_index.head())
GDPA GPDIA GCEA EXPGSA IMPGSA DATE 1929-01-01 104.556 17.170 9.622 5.939 5.556 1930-01-01 92.160 11.428 10.273 4.444 4.121 1931-01-01 77.391 6.549 10.169 2.906 2.905 1932-01-01 59.522 1.819 8.946 1.975 1.932 1933-01-01 57.154 2.276 8.875 1.987 1.929
We can also set the index as we read in the file. Let's take a look at the read_csv() function.
gdp.read_csv?
Object `gdp.read_csv` not found.
I'm seeing a lot of good stuff here. index_col
, usecols
, header
, sep
,...some stuff I don't know about, too. When reading in messy files, these extra arguments may come in handy.
Let's give index_col
a try.
gdp_2 = pd.read_csv('gdp_components.csv', index_col = 0) # Treat the CSV like a DataFrame. Count cols staring with 0
gdp_2.head()
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929-01-01 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930-01-01 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931-01-01 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932-01-01 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933-01-01 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
Navigating your file structure¶
We dumped our file into our current working directory so we could just ask for the file name gdp_components.csv
in read_csv()
. What is our current working directory (cwd)?
path_to_cwd = os.getcwd() # getcwd() is part of the os package we imported earlier
print(path_to_cwd)
C:\Users\jt83241\Dropbox\Teaching\Undergrad\Data Science for Business and Economics\Lectures
When we gave read_csv() gpd_components.csv, it looked in our cwd for the file. Let's try something more complicated. Make a copy of the gdp_components file and paste it into your Data
folder. Rename the file gdp_components_moved.csv
.
gdp_moved = pd.read_csv('gdp_components_moved.csv')
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) Input In [12], in <cell line: 1>() ----> 1 gdp_moved = pd.read_csv('gdp_components_moved.csv') File ~\Anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 305 if len(args) > num_allow_args: 306 warnings.warn( 307 msg.format(arguments=arguments), 308 FutureWarning, 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 665 kwds_defaults = _refine_defaults_read( 666 dialect, 667 delimiter, (...) 676 defaults={"delimiter": ","}, 677 ) 678 kwds.update(kwds_defaults) --> 680 return _read(filepath_or_buffer, kwds) File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:575, in _read(filepath_or_buffer, kwds) 572 _validate_names(kwds.get("names", None)) 574 # Create the parser. --> 575 parser = TextFileReader(filepath_or_buffer, **kwds) 577 if chunksize or iterator: 578 return parser File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds) 930 self.options["has_index_names"] = kwds["has_index_names"] 932 self.handles: IOHandles | None = None --> 933 self._engine = self._make_engine(f, self.engine) File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:1217, in TextFileReader._make_engine(self, f, engine) 1213 mode = "rb" 1214 # error: No overload variant of "get_handle" matches argument types 1215 # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]" 1216 # , "str", "bool", "Any", "Any", "Any", "Any", "Any" -> 1217 self.handles = get_handle( # type: ignore[call-overload] 1218 f, 1219 mode, 1220 encoding=self.options.get("encoding", None), 1221 compression=self.options.get("compression", None), 1222 memory_map=self.options.get("memory_map", False), 1223 is_text=is_text, 1224 errors=self.options.get("encoding_errors", "strict"), 1225 storage_options=self.options.get("storage_options", None), 1226 ) 1227 assert self.handles is not None 1228 f = self.handles.handle File ~\Anaconda3\lib\site-packages\pandas\io\common.py:789, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 784 elif isinstance(handle, str): 785 # Check whether the filename is to be opened in binary mode. 786 # Binary mode does not support 'encoding' and 'newline'. 787 if ioargs.encoding and "b" not in ioargs.mode: 788 # Encoding --> 789 handle = open( 790 handle, 791 ioargs.mode, 792 encoding=ioargs.encoding, 793 errors=errors, 794 newline="", 795 ) 796 else: 797 # Binary mode 798 handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: 'gdp_components_moved.csv'
Of course this doesn't work. The file is not in our cwd. It's good see what that kind of error message looks like. We need to pass csv_read() the path to the file. The path is the hierarchy of folders that contains the file. In my case, the path is
C:\Users\jt83241\Dropbox\Teaching\Undergrad\Data Science for Business and Economics\Lectures
Note that there is a /
each time we list a new folder. We can also use ./
where the dot means the cwd. Ok, I lied a little. Windows uses backslashes while the rest of the world uses forward slashes. Here is why article.
gdp_moved = pd.read_csv('./Data/gdp_components_moved.csv')
gdp_moved.head()
We could have manipulated some strings to get to this, too. This approach might be useful if you needed to read in many files from the same place. (Maybe using a for loop and a list of file names?)
path_to_cwd = os.getcwd()
file_name = 'gdp_components_moved.csv'
path_to_data_file = path_to_cwd + '/Data/' + file_name
print(path_to_data_file)
C:\Users\jt83241\Dropbox\Teaching\Undergrad\Data Science for Business and Economics\Lectures/Data/gdp_components_moved.csv
gdp_moved = pd.read_csv(path_to_data_file, index_col=0)
gdp_moved.head()
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929-01-01 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930-01-01 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931-01-01 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932-01-01 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933-01-01 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
Practice: Reading (and Writing) CSVs ¶
Take a few minutes and try the following. Feel free to chat with those around if you get stuck.
- Try out the
to_csv()
method of DataFrame. Savegdp_moved
as 'gdp_moved_2.csv' in your cwd. [You can use?
if you need help.]
gdp_moved.to_csv('gpd_moved_2.csv')
- Use
to_csv()
again to savegdp_moved
to the Data folder. Name it 'gdp_moved_3.csv'
gdp_moved.to_csv('./Data/gdp_moved_3.csv')
Are your files in the correct places?
Isn't this supposed to be practice reading in CSV files? Right. Let's do some of that.
- Use gdp_moved_3.csv to create a DataFrame named gdp_growth. Set the index to the dates. Print out the first 10 years of data.
gdp_growth = pd.read_csv('./Data/gdp_moved_3.csv', index_col=0)
gdp_growth.head(10)
GDPA | GPDIA | GCEA | EXPGSA | IMPGSA | |
---|---|---|---|---|---|
DATE | |||||
1929-01-01 | 104.556 | 17.170 | 9.622 | 5.939 | 5.556 |
1930-01-01 | 92.160 | 11.428 | 10.273 | 4.444 | 4.121 |
1931-01-01 | 77.391 | 6.549 | 10.169 | 2.906 | 2.905 |
1932-01-01 | 59.522 | 1.819 | 8.946 | 1.975 | 1.932 |
1933-01-01 | 57.154 | 2.276 | 8.875 | 1.987 | 1.929 |
1934-01-01 | 66.800 | 4.296 | 10.721 | 2.561 | 2.239 |
1935-01-01 | 74.241 | 7.370 | 11.151 | 2.769 | 2.982 |
1936-01-01 | 84.830 | 9.391 | 13.398 | 3.007 | 3.154 |
1937-01-01 | 93.003 | 12.967 | 13.119 | 4.039 | 3.961 |
1938-01-01 | 87.352 | 7.944 | 14.170 | 3.811 | 2.845 |
- Rename 'GDPA' to 'gdp' and rename 'GCEA' to 'gov'
gdp_growth.rename(columns={'GDPA':'gdp', 'GCEA':'gov'}, inplace=True)
print(gdp_growth.head())
gdp GPDIA gov EXPGSA IMPGSA DATE 1929-01-01 104.556 17.170 9.622 5.939 5.556 1930-01-01 92.160 11.428 10.273 4.444 4.121 1931-01-01 77.391 6.549 10.169 2.906 2.905 1932-01-01 59.522 1.819 8.946 1.975 1.932 1933-01-01 57.154 2.276 8.875 1.987 1.929
Reading Excel spreadsheets¶
Reading spreadsheets isn't much different than reading csv files. But, since workbooks are more complicated than csv files, we have a few more options to consider.
If you haven't already, copy over debt.xlsx to your cwd. Let's open it in Excel and have a look at it...
There's a lot going on here: missing data, some #N/A stuff, and several header rows. Let's get to work.
debt = pd.read_excel('./Data/debt.xlsx')
debt.head(10)
FRED Graph Observations | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | |
---|---|---|---|---|
0 | Federal Reserve Economic Data | NaN | NaN | NaN |
1 | Link: https://fred.stlouisfed.org | NaN | NaN | NaN |
2 | Help: https://fred.stlouisfed.org/help-faq | NaN | NaN | NaN |
3 | Economic Research Division | NaN | NaN | NaN |
4 | Federal Reserve Bank of St. Louis | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN |
6 | GDPA | Gross Domestic Product, Billions of Dollars, A... | NaN | NaN |
7 | GFDEBTN | Federal Debt: Total Public Debt, Millions of D... | NaN | NaN |
8 | DGS10 | 10-Year Treasury Constant Maturity Rate, Perce... | NaN | NaN |
9 | NaN | NaN | NaN | NaN |
This type of thing is probably pretty common so my guess is someone wrote pandas code to load the data more efficienctly. Let's learn more about pd.read_excel
to see if we can figure out how better to load the data.
pd.read_excel?
Let's use the header
argument to specify the row to use as the column names and -- as usual -- the notation is zero-based.
debt = pd.read_excel('./Data/debt.xlsx', header = 12)
print(debt.head())
print('\n')
print(debt.tail())
observation_date GDPA GFDEBTN DGS10 0 1929-01-01 104.556 NaN NaN 1 1930-01-01 92.160 NaN NaN 2 1931-01-01 77.391 NaN NaN 3 1932-01-01 59.522 NaN NaN 4 1933-01-01 57.154 NaN NaN observation_date GDPA GFDEBTN DGS10 85 2014-01-01 17521.747 17799837.00 2.539560 86 2015-01-01 18219.297 18344212.75 2.138287 87 2016-01-01 18707.189 19549200.50 1.837440 88 2017-01-01 19485.394 20107155.25 2.329480 89 2018-01-01 NaN NaN NaN
That's looking good. Notice that Pandas added NaN for the missing data and for those #N\A entries. We will have to deal with those at some point. The header parameter is part of read_csv()
, too.
We didn't specify which sheet in the workbook to load, so Pandas took the first one. The documentation says we can ask for sheets by name.
debt_q = pd.read_excel('./Data/debt.xlsx', header=12, sheet_name='quarterly')
print(debt_q.head())
print('\n')
print(debt_q.tail())
observation_date GFDEBTN DGS10 GDP 0 1947-01-01 NaN NaN 243.164 1 1947-04-01 NaN NaN 245.968 2 1947-07-01 NaN NaN 249.585 3 1947-10-01 NaN NaN 259.745 4 1948-01-01 NaN NaN 265.742 observation_date GFDEBTN DGS10 GDP 281 2017-04-01 19844554.0 2.260952 19359.123 282 2017-07-01 20244900.0 2.241429 19588.074 283 2017-10-01 20492747.0 2.371452 19831.829 284 2018-01-01 21089643.0 2.758525 20041.047 285 2018-04-01 21195070.0 2.920625 20411.924
The documentation also tells us that we can ask for just a subset of the columns when reading in a file (csv or xlsx) using the usecols
argument. This takes either integers or Excel column letters.
# Take the first and third columns of sheet 'quarterly'
interest_rates = pd.read_excel('./Data/debt.xlsx', header=12, sheet_name='quarterly', usecols=[0,2])
interest_rates.head()
observation_date | DGS10 | |
---|---|---|
0 | 1947-01-01 | NaN |
1 | 1947-04-01 | NaN |
2 | 1947-07-01 | NaN |
3 | 1947-10-01 | NaN |
4 | 1948-01-01 | NaN |
Practice: Reading Excel ¶
Take a few minutes and try the following. Feel free to chat with those around if you get stuck.
- Read in the quarterly data from 'debt.xlsx' and keep only the columns with the date, gdp, and GFDEBTN. Name your new DataFrame
fed_debt
.
fed_debt = pd.read_excel('./Data/debt.xlsx', header=12, sheet_name='quarterly', usecols=[0,1,3])
fed_debt.head()
observation_date | GFDEBTN | GDP | |
---|---|---|---|
0 | 1947-01-01 | NaN | 243.164 |
1 | 1947-04-01 | NaN | 245.968 |
2 | 1947-07-01 | NaN | 249.585 |
3 | 1947-10-01 | NaN | 259.745 |
4 | 1948-01-01 | NaN | 265.742 |
- Oops, I wanted to set the observation_date to the index. Go back and add that to your solution to 1.
fed_debt = pd.read_excel('./Data/debt.xlsx', header=12, sheet_name='quarterly', usecols=[0,1,3], index_col=0)
fed_debt.head()
GFDEBTN | GDP | |
---|---|---|
observation_date | ||
1947-01-01 | NaN | 243.164 |
1947-04-01 | NaN | 245.968 |
1947-07-01 | NaN | 249.585 |
1947-10-01 | NaN | 259.745 |
1948-01-01 | NaN | 265.742 |
- What is 'GFDEBTN'? It is the federal debt, in millions. Rename this variable to 'DEBT'
fed_debt.rename(columns={'GFDEBTN':'DEBT'}, inplace=True)
fed_debt.head()
DEBT | GDP | |
---|---|---|
observation_date | ||
1947-01-01 | NaN | 243.164 |
1947-04-01 | NaN | 245.968 |
1947-07-01 | NaN | 249.585 |
1947-10-01 | NaN | 259.745 |
1948-01-01 | NaN | 265.742 |
- Create a variable name debt_ratio that is the debt to GDP ratio. Debt is in millions and gdp is in billions. Adjust accordingly. Print the last eight rows.
fed_debt['debt_ratio'] = (fed_debt['DEBT']/1000)/fed_debt['GDP']
fed_debt.sample(8)
DEBT | GDP | debt_ratio | |
---|---|---|---|
observation_date | |||
1970-01-01 | 372007.0 | 1051.200 | 0.353888 |
1995-04-01 | 4951372.0 | 7580.997 | 0.653129 |
1966-07-01 | 324748.0 | 819.638 | 0.396209 |
2001-10-01 | 5943439.0 | 10660.294 | 0.557530 |
2007-01-01 | 8849665.0 | 14208.569 | 0.622840 |
1966-10-01 | 329319.0 | 833.302 | 0.395198 |
1952-10-01 | NaN | 380.812 | NaN |
1981-01-01 | 964531.0 | 3124.206 | 0.308728 |
There are a lot of missing debt values. Did Pandas throw an error? No. Pandas knows (in some cases) how to work around missing data.
- Summarize the debt_ratio variable using the
.describe()
method. What is its max level? Its min?
print(fed_debt['debt_ratio'].describe())
count 210.000000 mean 0.564994 std 0.227520 min 0.306033 25% 0.355102 50% 0.555767 75% 0.641648 max 1.052562 Name: debt_ratio, dtype: float64
# Read in the penn world table data
url = "http://www.rug.nl/ggdc/docs/pwt81.xlsx"
pwt = pd.read_excel(url, sheet_name= "Data")
pwt.head()
countrycode | country | currency_unit | year | rgdpe | rgdpo | pop | emp | avh | hc | ... | csh_g | csh_x | csh_m | csh_r | pl_c | pl_i | pl_g | pl_x | pl_m | pl_k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AGO | Angola | Kwanza | 1950 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | AGO | Angola | Kwanza | 1951 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | AGO | Angola | Kwanza | 1952 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | AGO | Angola | Kwanza | 1953 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AGO | Angola | Kwanza | 1954 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 47 columns
That took a few seconds --- this is a pretty big file. Let's try an example from McKinney's book: baby name counts for a year. I wonder what was trendy in 1880.
baby_url = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames'
# What was trendy in 1880?
old_names = pd.read_csv(baby_url + '//yob1880.txt')
old_names.head()
Mary | F | 7065 | |
---|---|---|---|
0 | Anna | F | 2604 |
1 | Emma | F | 2003 |
2 | Elizabeth | F | 1939 |
3 | Minnie | F | 1746 |
4 | Margaret | F | 1578 |
We've lost Mary, which looks pretty popular. It was used as a column label. Not good. Let's fix this by specifying header=None
to tell the computer to not import a header (i.e., don't import column names).
old_names = pd.read_csv(baby_url + '//yob1880.txt', header=None)
old_names.head()
0 | 1 | 2 | |
---|---|---|---|
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
old_names = pd.read_csv(baby_url + '//yob2009.txt', header=None)
old_names.head()
0 | 1 | 2 | |
---|---|---|---|
0 | Isabella | F | 22222 |
1 | Emma | F | 17830 |
2 | Olivia | F | 17374 |
3 | Sophia | F | 16869 |
4 | Ava | F | 15826 |
old_names.rename(columns={0:'name', 1:'sex', 2:'count'}, inplace=True)
old_names.head()
name | sex | count | |
---|---|---|---|
0 | Isabella | F | 22222 |
1 | Emma | F | 17830 |
2 | Olivia | F | 17374 |
3 | Sophia | F | 16869 |
4 | Ava | F | 15826 |
- What are the two most popular female names in 1880? Try the
.sort_values()
method but be sure to learn about it first...
# 1. Isolate female names
old_names_female=old_names[old_names['sex']=='F']
# 2. Sort in descending order
old_names_female_sorted = old_names_female.sort_values(by=['count'], ascending=False)
old_names_female_sorted.head(2)
name | sex | count | |
---|---|---|---|
0 | Isabella | F | 22222 |
1 | Emma | F | 17830 |
- What are the two least popular female names in 1880? Similar drill. If you got the answer to (2), this will be easy.
old_names_female_sorted.tail(2)
name | sex | count | |
---|---|---|---|
18193 | Gilda | F | 5 |
20122 | Zyriel | F | 5 |
Pisa Scores¶
Run the following code to create a DataFrame with mean scores in math, reading, and science.
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
usecols=[0,1,9,13]) # select columns of interest
Note that we're not setting the index yet. There is some junk at the top and bottom of the spreadsheet.
- Use the the
skipheader
andskipfooter
arguments to find the good data.
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
skiprows=18, # skip the first 18 rows
skipfooter=7, # skip the last 7
usecols=[0,1,9,13], # select columns of interest
#index_col=0, # set the index as the first column
#header=[0,1] # set the variable names
)
pisa
Unnamed: 0 | Mathematics | Reading | Science | |
---|---|---|---|---|
0 | NaN | Mean score in PISA 2012 | Mean score in PISA 2012 | Mean score in PISA 2012 |
1 | NaN | NaN | NaN | NaN |
2 | OECD average | 494.046447 | 496.462864 | 501.159793 |
3 | NaN | NaN | NaN | NaN |
4 | Shanghai-China | 612.675536 | 569.588408 | 580.117831 |
... | ... | ... | ... | ... |
64 | Jordan | 385.595556 | 399.034747 | 409.367215 |
65 | Colombia | 376.488601 | 403.402534 | 398.678632 |
66 | Qatar | 376.448399 | 387.504131 | 383.642553 |
67 | Indonesia | 375.114452 | 396.120095 | 381.911485 |
68 | Peru | 368.102547 | 384.151223 | 373.113448 |
69 rows × 4 columns
- There are a lot of missing values. Let's clean-up the data. Use the
.dropna()
method to get rid of rows with NaN. As always, be sure to learn about the function so you can use if properly.
pisa2 = pisa.dropna()
pisa2
Unnamed: 0 | Mathematics | Reading | Science | |
---|---|---|---|---|
2 | OECD average | 494.046447 | 496.462864 | 501.159793 |
4 | Shanghai-China | 612.675536 | 569.588408 | 580.117831 |
5 | Singapore | 573.468314 | 542.215834 | 551.493157 |
6 | Hong Kong-China | 561.241096 | 544.600086 | 554.937434 |
7 | Chinese Taipei | 559.824796 | 523.118904 | 523.314904 |
... | ... | ... | ... | ... |
64 | Jordan | 385.595556 | 399.034747 | 409.367215 |
65 | Colombia | 376.488601 | 403.402534 | 398.678632 |
66 | Qatar | 376.448399 | 387.504131 | 383.642553 |
67 | Indonesia | 375.114452 | 396.120095 | 381.911485 |
68 | Peru | 368.102547 | 384.151223 | 373.113448 |
66 rows × 4 columns
- Make the country names the index.
pisa2.set_index('Unnamed: 0', inplace=True)
pisa2
Mathematics | Reading | Science | |
---|---|---|---|
Unnamed: 0 | |||
OECD average | 494.046447 | 496.462864 | 501.159793 |
Shanghai-China | 612.675536 | 569.588408 | 580.117831 |
Singapore | 573.468314 | 542.215834 | 551.493157 |
Hong Kong-China | 561.241096 | 544.600086 | 554.937434 |
Chinese Taipei | 559.824796 | 523.118904 | 523.314904 |
... | ... | ... | ... |
Jordan | 385.595556 | 399.034747 | 409.367215 |
Colombia | 376.488601 | 403.402534 | 398.678632 |
Qatar | 376.448399 | 387.504131 | 383.642553 |
Indonesia | 375.114452 | 396.120095 | 381.911485 |
Peru | 368.102547 | 384.151223 | 373.113448 |
66 rows × 3 columns
- What percent is the US pisa score as compared to the "OECD average"?
print(pisa2.loc['United States']/pisa2.loc['OECD average'])
Mathematics 0.974335 Reading 1.002254 Science 0.992517 dtype: object
- Challenging. How correlated are pisa math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corr.csv' using
.to_csv()
. This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' and corr() won't work. Convert the three columns to numbers and then find the correlation matrix using the.corr()
method.
pisa2.columns = ['math', 'read', 'sci'] # changing the column names (not necessary)
# Brute force
for var in pisa2.columns:
pisa2[var] = pisa2[var].astype('float')
pisa2.corr()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_20352\3479768300.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy pisa2[var] = pisa2[var].astype('float')
math | read | sci | |
---|---|---|---|
math | 1.000000 | 0.959806 | 0.972131 |
read | 0.959806 | 1.000000 | 0.978559 |
sci | 0.972131 | 0.978559 | 1.000000 |
# Fancier approach. The "apply" method applies the function "pd.to_numeric" to each column
# in the list ['math', 'read', 'sci']. We can "apply" any function, even ones we've created.
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)
pisa2.corr()
math | read | sci | |
---|---|---|---|
math | 1.000000 | 0.959806 | 0.972131 |
read | 0.959806 | 1.000000 | 0.978559 |
sci | 0.972131 | 0.978559 | 1.000000 |
# Save as csv
pisa2.to_csv('./pisa_corr.csv')