# Initialize Otter
import otter
= otter.Notebook("lab1-pandas.ipynb") grader
Lab 1: Pandas Overview
Lab 1: Pandas Overview
Pandas is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data tidying operations/tools in Pandas.
Objectives
This lab covers the following topics:
- Dataframe basics
- Creating dataframes
- Dataframe indexing and attributes
- Adding, removing, and renaming variables
- Operations on dataframes
- Slicing (selecting rows and columns)
- Filtering (selecting rows that meet certain conditions)
- Grouping and aggregation
- Summary statistics (mean, median, variance, etc.)
- Grouped summaries
- Chaining operations and style guidelines
- Pivoting
Note: The Pandas interface is notoriously confusing, and the documentation is not consistently great. Be prepared to search through Pandas documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!
import numpy as np
import altair as alt
import pandas as pd
Creating DataFrames & Basic Manipulations
A dataframe is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers). An index is represented by a series object, which is a one-dimensional labeled array. Here you’ll cover:
- creating dataframes from scratch;
- retrieving attributes;
- dataframe indexing;
- adding, removing, and renaming columns.
Creating dataframes from scratch
The documentation for the pandas DataFrame
class provide two primary syntaxes to create a data frame from scratch:
- from a dictionary
- row-wise tuples
Syntax 1 (dictionary): You can create a data frame by specifying the columns and values using a dictionary (a concatenation of named lists) as shown below.
The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.
# define a dataframe using dictionary syntax
= pd.DataFrame(
fruit_info = { 'fruit': ['apple', 'orange', 'banana', 'raspberry'],
data 'color': ['red', 'orange', 'yellow', 'pink']
})
# print
fruit_info
Syntax 2 (row tuples): You can also define a dataframe by specifying the rows as tuples.
Each row corresponds to a distinct tuple, and the column indices are specified separately.
# define the same dataframe using tuple syntax
= pd.DataFrame(
fruit_info2 "apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
[(= ["fruit", "color"]
columns
)
# print
fruit_info2
Dataframe Attributes
DataFrames have several basic attributes:
.shape
contains dimensions;.dtypes
contains data types (float, integer, object, etc.).size
first (row) dimension;.values
contains an array comprising each entry in the dataframe..columns
contains the column index;.index
contains the row index.
You can obtain these attributes by appending the attribute name to the dataframe name. For instance, the dimensions of a dataframe df
can be retrieved by df.shape
.
# dimensions
fruit_info.shape
To retrieve a two-dimensional numpy array with the values of the dataframe, use df.values
. It is sometimes useful to extract this data structure in order to perform vectorized operations, linear algebra, and the like.
# as array
fruit_info.values
Dataframe Indexing
The entries in a dataframe are indexed. Indices for rows and columns are stored as the .index.
and .columns
attributes, respectively.
fruit_info.columns
fruit_info.index
By default, the row indexing is simply numbering by consecutive integers.
fruit_info.index.values
However, rows can alternatively be indexed by labels:
# define with a row index
pd.DataFrame("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
[(= ["fruit", "color"],
columns = ["fruit 1", "fruit 2", "fruit 3", 'fruit 4']
index )
Unlike data frames in R, the row index label figures prominently in certain operations. The elements of the dataframe can be retrived using .loc[ROW-INDEX, COL-INDEX]
which specifies the location of data values by name (not by position).
# retrieve row 0, column 'fruit'
0, 'fruit'] fruit_info.loc[
Most of the time rows are indexed numerically, and somewhat confusingly, the syntax for .loc
does not require putting the row index 0
in quotes, even though it refers to the row label and not the row number. This is important to remember, because often operations will scramble the order of rows. To see the difference, consider the following:
# non-consecutive row index
pd.DataFrame("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
[(= ["fruit", "color"],
columns = np.array([8, 6, 4, 2])
index )
Now adding .loc[2, 'color']
will retrieve pink
, the last row of the dataframe.
# subset
pd.DataFrame("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
[(= ["fruit", "color"],
columns = np.array([8, 6, 4, 2])
index 2, 'color'] ).loc[
To retrieve values by position, use .iloc
. For many, this is more intuitive, as it is most similar to matrix or array indexing in mathematical notation.
# retrieve 0, 0 entry
0, 0] fruit_info.iloc[
Adding, removing, and renaming columns
There are two ways to add new columns:
- direct specification;
- using
.loc[]
.
Direct specification: For a dataFrame df
, you can add a column with df['new column name'] = ...
and assign a list or array of values to the column.
Using .loc[]
: For a dataframe df
, you can add a column with df.loc[:, 'new column name'] = ...
and assign a list or array of values to the column.
Both accomplish the same task – adding a new column index and populating values for each row – but .loc[]
is a little faster.
Question 1
Using direct specification, add to the fruit_info
table a new column called rank1
containing integers 1, 2, 3, and 4, which express your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty). Make sure that the numbers utilized are unique - no ties are allowed.
...
# print
fruit_info
"q1") grader.check(
Now, create a new dataframe fruit_info_mod1
with the same information as fruit_info_original
, but has the additional column rank2
. Let’s start off with making fruit_info_mod1
as a copy of fruit_info
:
= fruit_info.copy() fruit_info_mod1
Question 2
Using .loc[]
, add a column called rank2
to the fruit_info_mod1
table that contains the same values in the same order as the rank1
column.
Hint: .loc
will parse :
as shorthand for ‘all indices’.
...
# print
fruit_info_mod1
"q2") grader.check(
When using the .loc[]
approach, the :
specifies that values are assigned to all rows of the data frame, so the array assigned to the new variable must be the same length as the data frame. What if we only assign values to certain rows? Try running the cell below.
# define new variable just for rows 1 and 2
1:2, 'rank3'] = [1, 2]
fruit_info_mod1.loc[
# check result
fruit_info_mod1
The remaining rows are assigned missing values. Notice what this does to the data type:
# check data types
fruit_info_mod1.dtypes
We can detect these missing values using .isna()
:
# returns a logical data frame indicating whether each entry is missing or not
fruit_info_mod1.isna()
It would be more helpful to simply see by column whether there are missing values. Appending a .any()
to the above command will do the trick:
# detects whether any column has missing entries
any() fruit_info_mod1.isna().
Now that we’ve had a bit of fun let’s remove those rank variables. Columns can be removed using .drop()
with a list of column names to drop as its argument. For example:
# first syntax for .drop()
= 'color') fruit_info_mod1.drop(columns
There is an alternate syntax to that shown above, which involves specifying the axis
(row vs. column) and index name to drop:
# second syntax for .drop()
'color', axis = 1) fruit_info_mod1.drop(
Question 3
Use the .drop()
method to drop all rank
columns you created in fruit_info_mod1
. Note that drop
does not change the table, but instead returns a new table with fewer columns or rows. To store the result, assign a new name (or write over the old dataframe). Here, assign the result to fruit_info_original
.
Hint: Look through the documentation to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.
= ...
fruit_info_original
# print
fruit_info_original
"q3") grader.check(
Nifty trick: Use df.columns[df.columns.str.startswith('STRING')]
to retrieve all indices starting with STRING
and ix.values.tolist()
to convert an index to an array of index names to obtain a list of column names to drop. Combining these gives df.columns[df.columns.str.startswith('STRING')].values.tolist()
, and will return a list of all column names starting with STRING
. This can be used in conjunction with the hint to remove all columns starting with rank
.
# try the nifty trick here
Now create a new dataframe fruit_info_mod2
with the same information as fruit_info_original
, but has the column names capitalized. Begin by creating a copy fruit_info_mod2
of fruit_info_original
:
= fruit_info_original.copy() fruit_info_mod2
Question 4
Review the documentation for .rename()
. Based on the examples, rename the columns of fruit_info_mod2
so they begin with capital letters.
For many operations, you can change the dataframe ‘in place’ without reassigning the result of the operation to a new name by setting the inplace
parameter to True
. Use that strategy here.
...
# print
fruit_info_mod2
"q4") grader.check(
Operations on Data Frames
With some basics in place, here you’ll see how to perform subsetting operations on data frames that are useful for tidying up datasets.
- Slicing: selecting columns or rows in chunks or by position.
- Often imported data contain columns that are either superfluous or not of interest for a particular project.
- You may also want to examine particular portions of a data frame.
- Filtering: selecting rows that meet certain criteria
- Often you’ll want to remove duplicate rows, filter missing observations, or select a structured subset of a data frame.
- Also helpful for inspection.
To illustrate these operations, you’ll use a dataset comprising counts of the given names of babies born in California each year from 1990 - 2018. The cell below imports the baby names data as a data frame from a .csv file. .head()
prints the first few rows of the dataset.
# import baby names data
= pd.read_csv('data/baby_names.csv')
baby_names
# preview first few rows
baby_names.head()
Your focus here isn’t on analyzing this data, so we won’t ask you to spend too much effort getting acquainted with it. However, a brief inspection is always a good idea. Let’s check:
- dimensions (number of rows and columns);
- how many distinct states, sexes, and years.
Note that the above dataframe displayed is a preview of the full dataframe.
Question 5
You’ve already seen how to examine dimensions using dataframe attributes. Check the dimensions of baby_names
and store them in dimensions_baby_names
.
= ... dimensions_baby_names
"q5") grader.check(
You haven’t yet seen how to retrieve the distinct values of an array or series, without duplicates. There are a few different ways to go about this, but one is to count the number of occurrences of each distinct entry in a column. This can be done by retrieving the column as a series using syntax of the form df.colname
, and then pass the result to .value_counts()
:
# count distinct values
baby_names.Sex.value_counts()
Question 6
Count the number of occurences of each distinct year. Create a series occur_per_year
that displays the number of occurrences, ordered by year (so that the years are displayed in order). If you add sort = False
as an argument to value_counts
, the distinct values will be displayed in the order they appear in the dataset.
How many years are represented in the dataset? Store your answer as num_years
.
= ...
occur_per_year
= ...
num_years
print(occur_per_year)
print(num_years)
"q6") grader.check(
Slicing: selecting rows and columns
There are two fast and simple ways to slice dataframes:
- using
.loc
to specify rows and columns by index; - using
.iloc
to specify rows and columns by position.
You have seen simple examples of both of these above. Here we’ll show how to use these two commands to retrieve multiple rows and columns.
Slicing with .loc
: specifying index names
This method retrieves entries by specifying row and column indexes using syntax of the form df.loc[rows, cols]
. The rows and columns can be single indices, a list of indices, or a set of adjacent indices using a colon :
. Examples of these usages are shown below.
# single indices -- small slice
2, 'Name'] baby_names.loc[
# a list of indices -- larger slice
2, 3], ['Name', 'Count']] baby_names.loc[[
# consecutive indices -- a chunk
2:10, 'Year':'Count'] baby_names.loc[
Slicing with .iloc
: specifying entry positions
An alternative to specifying the indices in order to slice a dataframe is to specify the entry positions using .iloc
(‘integer location’). You have seen an example of this too. As with .loc
, .iloc
can be used to select multiple rows/columns using either lists of positions or a consecutive set with from:to
syntax.
# single position
2, 3] baby_names.iloc[
# a list of positions
2, 3], [3, 4]] baby_names.iloc[[
# consecutive positions
2:11, 2:5] baby_names.iloc[
While these commands may look very similar to their .loc
analogs, there are some subtle but important differences. The row selection looks nearly identical, but recall that .loc
uses the index and .iloc
uses the position; they look so similar because typically index and position coincide.
However, sorting the baby_names
dataframe helps to reveal how the position of a row is not necessarily equal to the index of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the difference between .loc[]
and .iloc[]
.
# sort and display
= baby_names.sort_values(by=['Name'])
sorted_baby_names sorted_baby_names.head()
Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the Name
column of the baby_names
dataframe using both iloc[]
and loc[]
. Observe the difference, especially after sorting baby_names
by name.
# example iloc usage
1:4, 3] sorted_baby_names.iloc[
Notice that using loc[]
with 1:4 gives different results, since it selects using the index. The index gets moved around when you perform an operation like sort
on the dataframe.
# same syntax, different result
1:4, "Name"] sorted_baby_names.loc[
Above, the .loc
method retrieves all indexes between index 1 and index 4 in the order they appear in the sorted dataset. If instead we want to retrieve the same rows returned by the .iloc
command, we need to specify the row indices explicitly as a list:
# retrieve the same rows as iloc using loc
178791, 163914, 171112], 'Name'] sorted_baby_names.loc[[
Sometimes it’s useful for slicing (and other operations) to set one of the columns to be a row index, effectively treating one column as a collection of row labels. This can be accomplished using set_index
.
# change the (row) index from 0,1,2,... to the name column
= baby_names.set_index("Name")
baby_names_nameindexed baby_names_nameindexed.head()
We can now slice by name directly:
# slice rows for ashley and jennifer
'Ashley', 'Jennifer'], :] baby_names_nameindexed.loc[[
Question 7
Look up the name of a friend! Store the name as friend_name
. Use the name-indexed data frame to slice rows for the name of your choice and the Count
, Sex
, and Year
columns in that order, and store the data frame as friend_slice
.
# if your friend's name is not in the database, use another name
= ...
friend_name = ...
friend_slice
#print
friend_slice
"q7") grader.check(
Filtering
Filtering is sifting out rows according to a criterion, and can be accomplished using an array or series of True
s and False
s defined by a comparison. To take a simple example, say you wanted to filter out all names with fewer than 1000 occurrences. First you could define a logical series:
# true if filtering criterion is met, false otherwise
= baby_names.Count > 1000 arr
Then you can filter using that array:
# filter
= baby_names[arr]
baby_names_filtered baby_names_filtered.head()
Notice that the filtered array is much smaller than the overall array – only about 2000 rows correspond to a name occurring more than 1000 times in a year for a gender.
# compare dimensions
print(baby_names_filtered.shape)
print(baby_names.shape)
You have already encountered this concept in lab 0 when subsetting an array. For your reference, some commonly used comparison operators are given below.
Symbol | Usage | Meaning |
---|---|---|
== |
a == b |
Does a equal b? |
<= |
a <= b |
Is a less than or equal to b? |
>= |
a >= b |
Is a greater than or equal to b? |
< |
a < b |
Is a less than b? |
> |
a > b |
Is a greater than b? |
~ |
~p |
Returns negation of p |
| |
p | q |
p OR q |
& |
p & q |
p AND q |
^ |
p ^ q |
p XOR q (exclusive or) |
What if instead you wanted to filter using multiple conditions? Here’s an example of retrieving rows with counts exceeding 1000 for only the year 2001:
# filter using two conditions
== 2000) & (baby_names.Count > 1000)] baby_names[(baby_names.Year
Question 8
Select the girl names in 2010 that were given more than 3000 times, and store them as common_girl_names_2010
.
Note: Any time you use p & q
to filter the dataframe, make sure to use df[df[(p) & (q)]]
or df.loc[df[(p) & (q)]])
. That is, make sure to wrap conditions with parentheses to ensure the intended order of operations.
= ...
common_girl_names_2010
common_girl_names_2010
"q8") grader.check(
Grouping and aggregation
Grouping and aggregation are useful in generating data summaries, which are often important starting points in exploring a dataset.
Aggregation
Aggregation literally means ‘putting together’ (etymologically the word means ‘joining the herd’) – in statistics and data science, this refers to data summaries like an average, a minimum, or a measure of spread such as the sample variance or mean absolute deviation (data herding!). From a technical point of view, operations that take multiple values as inputs and return a single output are considered summaries – in other words, statistics. Some of the most common aggregations are:
- sum
- product
- count
- number of distinct values
- mean
- median
- variance
- standard deviation
- minimum/maximum
- quantiles
Pandas has built-in dataframe operations that compute most of these summaries across either axis (column-wise or row-wise):
.sum()
.prod()
.mean()
.median()
.var()
.std()
.nunique()
.min()
and.max()
.quantile()
To illustrate these operations, let’s filter out all names in 1995.
# filter 1995 names
= baby_names[baby_names.Year == 1995] names_95
How many individuals were counted in total in 1995? We can address that by computing a sum of the counts:
# n for 1995
sum() names_95.Count.
What is the typical frequency of all names in 1995? We can address that by computing the average count:
# average count for a name in 1995
names_95.Count.mean()
Question 9
Find how often the most common name 1995 was given and store this as names_95_max_count
. Use this value to filter names_95
and find which name was most common that year. Store the filtered dataframe as names_95_most_common_name
.
= ...
names_95_max_count = ...
names_95_most_common_name
print("Number of people with the most frequent name in 1995 is :", names_95_max_count, "people")
print("Most frequent name in 1995 is:", names_95_most_common_name.values[0])
"q9") grader.check(
Caution! If applied to the entire dataframe, the operation df.max()
(or any other aggregation) will return the maximum of each column. Notice that the cell below does not return the row you found just now, but could easily be misinterpreted as such. The cell does tell you that the maximum value of sex (alphabetically last) is M and the maximum name (alphabetically last) is Zyanya and the maximum count is 5003; it does not tell you that 5003 boys were named Zyanya in 1995.
# maximum of each variable
max() names_95.
Grouping
What if you want to know the most frequent male and female names? If so, you’ll need to repeat the above operations group-wise by sex.
In general, any variable in a dataframe can be used to define a grouping structure on the rows (or, less commonly, columns). After grouping, any dataframe operations will be executed within each group, but not across groups. This can be used to generate grouped summaries, such as the maximum count for boys and girls; as a point of terminology, we’d describe this summary as ‘maximum count by sex’.
The .groupby()
function defines such a structure; here is the documentation. The cell below groups the names_95
dataframe by sex. Notice that when the grouped dataframe is previewed with .head()
, the first few rows are returned for each group.
# grouped dataframe
= names_95.groupby('Sex')
names_95_bysex
# print
2) names_95_bysex.head(
Any aggregation operations applied to the grouped dataframe will be applied separately to the rows where Sex == M
and the rows where Sex == F
. For example, computing .sum()
on the grouped dataframe will show the total number of individuals in the data for 1995 by sex:
# number of individuals by sex
sum() names_95_bysex.Count.
The most frequent boy and girl names can be found using .idxmax()
groupwise to obtain the index of the first occurence of the maximum count for each sex, and then slicing with .loc
:
# first most common names by sex
names_95.loc[names_95_bysex.Count.idxmax(), :]
Since .idxmax()
gives the index of the first occurrence, these are the alphabetically first most common names; there could be ties. You know from your work so far that there are no ties for the male names; another filtering step can be used to check for ties among the female names.
# ties?
max().values[0] == names_95['Count']] names_95[names_95_bysex.Count.
So, no ties.
Question 10
Are there more girl names or boy names in 1995? Use the grouped dataframe names_95_bysex
with the .count()
aggregation to find the total number of names for each sex. Store the female and male counts as girl_name_count
and boy_name_count
, respectfully.
= ...
girl_name_count = ...
boy_name_count
#print
print(girl_name_count)
print(boy_name_count)
"q10") grader.check(
Chaining operations
You have already seen examples of this, but pandas and numpy operations can be chained together in sequence. For example, names_95.Count.max()
is a chain with two steps: first select the Count
column (.count
); then compute the maximum (.max()
).
Grouped summaries are often convenient to compute in a chained fashion, rather than by assigning the grouped dataframe a new name and performing operations on the resulting object. For example, finding the total number of boys and girls recorded in the 1995 data can be done with the following chain:
# repeating previous calculation, more streamlined
'Sex').Count.sum() names_95.groupby(
We can take this even one step further and also perform the filtering in sequence as part of the chain:
# longer chain
== 1995].groupby('Sex').Count.sum() baby_names[baby_names.Year
Chains can get somewhat long, but they have the advantage of making codes more efficient, and often more readable. We did above in one step what took several lines before. Further, this chain can almost be read aloud:
“Take baby names, filter on year, then group by sex, then select name counts, then compute the sum.”
Let’s now consider computing the average counts of boy and girl names for each year 1990-1995. This can be accomplished by the following chain (notice it is possible to group by multiple variables).
# average counts by sex and year
<= 1995].groupby(['Year', 'Sex']).mean(numeric_only = True) baby_names[baby_names.Year
This display is not ideal. We can ‘pivot’ the table into a wide format by adding a few extra steps in the chain: change the indices to columns; then define a new shape by specifying which column should be the new row index, which should be the new column index, and which values should populate the table.
# average counts by sex and year
<= 1995].groupby(
baby_names[baby_names.Year 'Year', 'Sex']
[
).mean(= True
numeric_only
).reset_index().pivot(= 'Sex', columns = 'Year', values = 'Count'
index )
Style comment: break long chains over multiple lines with indentation. The above chain is too long to be readable in one line. To balance the readability of codes with the efficiency of chaining, it is good practice to break long chains over several lines, with appropriate indentations.
Here are some rules of thumb on style.
- Separate comparisons by spaces (
a<b
asa < b
) - Split chains longer than 30-40 characters over multiple lines
- Split lines between delimiters
(
,)
- Increase indentation for lines between delimiters
- For chained operations, try to get each step in the chain shown on a separate line
- For functions with multiple arguments, split lines so that each argument is on its own line
Question 11
Write a chain with appropriate style to display the (first) most common boy and girl names in each of the years 2005-2015. Do this in two steps:
- First filter
baby_names
by year, then group by year and sex, and then find the indices of first occurence of the largest counts. Store these indices asind
. - Then use
.loc[]
with your stored indices to slicebaby_names
so as to retrieve the rows corresponding to each most frequent name each year and for each sex; then pivot this table so that the columns are years, the rows are sexes, and the entries are names. Store this aspivot_names
.
...print(ind)
pivot_names
"q11") grader.check(
Submission
- Save the notebook.
- Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
- Carefully look through your notebook and verify that all computations execute correctly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
- Download the notebook as an
.ipynb
file. This is your backup copy. - Export the notebook as PDF and upload to Gradescope.
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()