Lab 1: Pandas Overview

# Initialize Otter
import otter
grader = otter.Notebook("lab1-pandas.ipynb")

Lab 1: Pandas Overview

Pandas is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data tidying operations/tools in Pandas.

Objectives

This lab covers the following topics:

Dataframe basics
- Creating dataframes
- Dataframe indexing and attributes
- Adding, removing, and renaming variables
Operations on dataframes
- Slicing (selecting rows and columns)
- Filtering (selecting rows that meet certain conditions)
Grouping and aggregation
- Summary statistics (mean, median, variance, etc.)
- Grouped summaries
- Chaining operations and style guidelines
- Pivoting

Note: The Pandas interface is notoriously confusing, and the documentation is not consistently great. Be prepared to search through Pandas documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!

import numpy as np
import altair as alt
import pandas as pd

Creating DataFrames & Basic Manipulations

A dataframe is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers). An index is represented by a series object, which is a one-dimensional labeled array. Here you’ll cover:

creating dataframes from scratch;
retrieving attributes;
dataframe indexing;
adding, removing, and renaming columns.

Creating dataframes from scratch

The documentation for the pandas DataFrame class provide two primary syntaxes to create a data frame from scratch:

from a dictionary
row-wise tuples

Syntax 1 (dictionary): You can create a data frame by specifying the columns and values using a dictionary (a concatenation of named lists) as shown below.

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

# define a dataframe using dictionary syntax
fruit_info = pd.DataFrame( 
    data = { 'fruit': ['apple', 'orange', 'banana', 'raspberry'],
             'color': ['red', 'orange', 'yellow', 'pink']
           })

# print
fruit_info

Syntax 2 (row tuples): You can also define a dataframe by specifying the rows as tuples.

Each row corresponds to a distinct tuple, and the column indices are specified separately.

# define the same dataframe using tuple syntax 
fruit_info2 = pd.DataFrame(
    [("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
    columns = ["fruit", "color"]
)

# print
fruit_info2

Dataframe Attributes

DataFrames have several basic attributes:

.shape contains dimensions;
.dtypes contains data types (float, integer, object, etc.)
.size first (row) dimension;
.values contains an array comprising each entry in the dataframe.
.columns contains the column index;
.index contains the row index.

You can obtain these attributes by appending the attribute name to the dataframe name. For instance, the dimensions of a dataframe df can be retrieved by df.shape.

# dimensions
fruit_info.shape

To retrieve a two-dimensional numpy array with the values of the dataframe, use df.values. It is sometimes useful to extract this data structure in order to perform vectorized operations, linear algebra, and the like.

# as array
fruit_info.values

Dataframe Indexing

The entries in a dataframe are indexed. Indices for rows and columns are stored as the .index. and .columns attributes, respectively.

fruit_info.columns

fruit_info.index

By default, the row indexing is simply numbering by consecutive integers.

fruit_info.index.values

However, rows can alternatively be indexed by labels:

# define with a row index
pd.DataFrame(
    [("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
    columns = ["fruit", "color"],
    index = ["fruit 1", "fruit 2", "fruit 3", 'fruit 4']
)

Unlike data frames in R, the row index label figures prominently in certain operations. The elements of the dataframe can be retrived using .loc[ROW-INDEX, COL-INDEX] which specifies the location of data values by name (not by position).

# retrieve row 0, column 'fruit'
fruit_info.loc[0, 'fruit']

Most of the time rows are indexed numerically, and somewhat confusingly, the syntax for .loc does not require putting the row index 0 in quotes, even though it refers to the row label and not the row number. This is important to remember, because often operations will scramble the order of rows. To see the difference, consider the following:

# non-consecutive row index
pd.DataFrame(
    [("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
    columns = ["fruit", "color"],
    index = np.array([8, 6, 4, 2])
)

Now adding .loc[2, 'color'] will retrieve pink, the last row of the dataframe.

# subset
pd.DataFrame(
    [("apple", "red"), ("orange", "orange"), ("banana", "yellow"), ("raspberry", "pink")],
    columns = ["fruit", "color"],
    index = np.array([8, 6, 4, 2])
).loc[2, 'color']

To retrieve values by position, use .iloc. For many, this is more intuitive, as it is most similar to matrix or array indexing in mathematical notation.

# retrieve 0, 0 entry
fruit_info.iloc[0, 0]

Adding, removing, and renaming columns

There are two ways to add new columns:

direct specification;
using .loc[].

Direct specification: For a dataFrame df, you can add a column with df['new column name'] = ... and assign a list or array of values to the column.

Using .loc[]: For a dataframe df, you can add a column with df.loc[:, 'new column name'] = ... and assign a list or array of values to the column.

Both accomplish the same task – adding a new column index and populating values for each row – but .loc[] is a little faster.

Question 1

Using direct specification, add to the fruit_info table a new column called rank1 containing integers 1, 2, 3, and 4, which express your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty). Make sure that the numbers utilized are unique - no ties are allowed.

...

# print
fruit_info

grader.check("q1")

Now, create a new dataframe fruit_info_mod1 with the same information as fruit_info_original, but has the additional column rank2. Let’s start off with making fruit_info_mod1 as a copy of fruit_info:

fruit_info_mod1 = fruit_info.copy()

Question 2

Using .loc[], add a column called rank2 to the fruit_info_mod1 table that contains the same values in the same order as the rank1 column.

Hint: .loc will parse : as shorthand for ‘all indices’.

...

# print
fruit_info_mod1

grader.check("q2")

When using the .loc[] approach, the : specifies that values are assigned to all rows of the data frame, so the array assigned to the new variable must be the same length as the data frame. What if we only assign values to certain rows? Try running the cell below.

# define new variable just for rows 1 and 2
fruit_info_mod1.loc[1:2, 'rank3'] = [1, 2]

# check result
fruit_info_mod1

The remaining rows are assigned missing values. Notice what this does to the data type:

# check data types
fruit_info_mod1.dtypes

We can detect these missing values using .isna():

# returns a logical data frame indicating whether each entry is missing or not
fruit_info_mod1.isna()

It would be more helpful to simply see by column whether there are missing values. Appending a .any() to the above command will do the trick:

# detects whether any column has missing entries
fruit_info_mod1.isna().any()

Now that we’ve had a bit of fun let’s remove those rank variables. Columns can be removed using .drop() with a list of column names to drop as its argument. For example:

# first syntax for .drop()
fruit_info_mod1.drop(columns = 'color')

There is an alternate syntax to that shown above, which involves specifying the axis (row vs. column) and index name to drop:

# second syntax for .drop()
fruit_info_mod1.drop('color', axis = 1)

Question 3

Use the .drop() method to drop all rank columns you created in fruit_info_mod1. Note that drop does not change the table, but instead returns a new table with fewer columns or rows. To store the result, assign a new name (or write over the old dataframe). Here, assign the result to fruit_info_original.

Hint: Look through the documentation to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.

fruit_info_original = ...

# print
fruit_info_original

grader.check("q3")

Nifty trick: Use df.columns[df.columns.str.startswith('STRING')] to retrieve all indices starting with STRING and ix.values.tolist() to convert an index to an array of index names to obtain a list of column names to drop. Combining these gives df.columns[df.columns.str.startswith('STRING')].values.tolist(), and will return a list of all column names starting with STRING. This can be used in conjunction with the hint to remove all columns starting with rank.

# try the nifty trick here

Now create a new dataframe fruit_info_mod2with the same information as fruit_info_original, but has the column names capitalized. Begin by creating a copy fruit_info_mod2 of fruit_info_original:

fruit_info_mod2 = fruit_info_original.copy()

Question 4

Review the documentation for .rename(). Based on the examples, rename the columns of fruit_info_mod2 so they begin with capital letters.

For many operations, you can change the dataframe ‘in place’ without reassigning the result of the operation to a new name by setting the inplace parameter to True. Use that strategy here.

...

# print
fruit_info_mod2

grader.check("q4")

Operations on Data Frames

With some basics in place, here you’ll see how to perform subsetting operations on data frames that are useful for tidying up datasets.

Slicing: selecting columns or rows in chunks or by position.
- Often imported data contain columns that are either superfluous or not of interest for a particular project.
- You may also want to examine particular portions of a data frame.
Filtering: selecting rows that meet certain criteria
- Often you’ll want to remove duplicate rows, filter missing observations, or select a structured subset of a data frame.
- Also helpful for inspection.

To illustrate these operations, you’ll use a dataset comprising counts of the given names of babies born in California each year from 1990 - 2018. The cell below imports the baby names data as a data frame from a .csv file. .head() prints the first few rows of the dataset.

# import baby names data
baby_names = pd.read_csv('data/baby_names.csv')

# preview first few rows
baby_names.head()

Your focus here isn’t on analyzing this data, so we won’t ask you to spend too much effort getting acquainted with it. However, a brief inspection is always a good idea. Let’s check:

dimensions (number of rows and columns);
how many distinct states, sexes, and years.

Note that the above dataframe displayed is a preview of the full dataframe.

Question 5

You’ve already seen how to examine dimensions using dataframe attributes. Check the dimensions of baby_names and store them in dimensions_baby_names.

dimensions_baby_names = ...

grader.check("q5")

You haven’t yet seen how to retrieve the distinct values of an array or series, without duplicates. There are a few different ways to go about this, but one is to count the number of occurrences of each distinct entry in a column. This can be done by retrieving the column as a series using syntax of the form df.colname, and then pass the result to .value_counts():

# count distinct values
baby_names.Sex.value_counts()

Question 6

Count the number of occurences of each distinct year. Create a series occur_per_year that displays the number of occurrences, ordered by year (so that the years are displayed in order). If you add sort = False as an argument to value_counts, the distinct values will be displayed in the order they appear in the dataset.

How many years are represented in the dataset? Store your answer as num_years.

occur_per_year = ...

num_years = ...

print(occur_per_year)
print(num_years)

grader.check("q6")

Slicing: selecting rows and columns

There are two fast and simple ways to slice dataframes:

using .loc to specify rows and columns by index;
using .iloc to specify rows and columns by position.

You have seen simple examples of both of these above. Here we’ll show how to use these two commands to retrieve multiple rows and columns.

Slicing with `.loc`: specifying index names

This method retrieves entries by specifying row and column indexes using syntax of the form df.loc[rows, cols]. The rows and columns can be single indices, a list of indices, or a set of adjacent indices using a colon :. Examples of these usages are shown below.

# single indices -- small slice
baby_names.loc[2, 'Name']

# a list of indices -- larger slice
baby_names.loc[[2, 3], ['Name', 'Count']]

# consecutive indices -- a chunk
baby_names.loc[2:10, 'Year':'Count']

Slicing with `.iloc`: specifying entry positions

An alternative to specifying the indices in order to slice a dataframe is to specify the entry positions using .iloc (‘integer location’). You have seen an example of this too. As with .loc, .iloc can be used to select multiple rows/columns using either lists of positions or a consecutive set with from:to syntax.

# single position
baby_names.iloc[2, 3]

# a list of positions
baby_names.iloc[[2, 3], [3, 4]]

# consecutive positions
baby_names.iloc[2:11, 2:5]

While these commands may look very similar to their .loc analogs, there are some subtle but important differences. The row selection looks nearly identical, but recall that .loc uses the index and .iloc uses the position; they look so similar because typically index and position coincide.

However, sorting the baby_names dataframe helps to reveal how the position of a row is not necessarily equal to the index of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the difference between .loc[] and .iloc[].

# sort and display
sorted_baby_names = baby_names.sort_values(by=['Name'])
sorted_baby_names.head()

Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the Name column of the baby_names dataframe using both iloc[] and loc[]. Observe the difference, especially after sorting baby_names by name.

# example iloc usage
sorted_baby_names.iloc[1:4, 3]

Notice that using loc[] with 1:4 gives different results, since it selects using the index. The index gets moved around when you perform an operation like sort on the dataframe.

# same syntax, different result
sorted_baby_names.loc[1:4, "Name"]

Above, the .loc method retrieves all indexes between index 1 and index 4 in the order they appear in the sorted dataset. If instead we want to retrieve the same rows returned by the .iloc command, we need to specify the row indices explicitly as a list:

# retrieve the same rows as iloc using loc
sorted_baby_names.loc[[178791, 163914, 171112], 'Name']

Sometimes it’s useful for slicing (and other operations) to set one of the columns to be a row index, effectively treating one column as a collection of row labels. This can be accomplished using set_index.

# change the (row) index from 0,1,2,... to the name column
baby_names_nameindexed = baby_names.set_index("Name") 
baby_names_nameindexed.head()

We can now slice by name directly:

# slice rows for ashley and jennifer
baby_names_nameindexed.loc[['Ashley', 'Jennifer'], :]

Question 7

Look up the name of a friend! Store the name as friend_name. Use the name-indexed data frame to slice rows for the name of your choice and the Count, Sex, and Year columns in that order, and store the data frame as friend_slice.

# if your friend's name is not in the database, use another name

friend_name = ...
friend_slice = ...

#print
friend_slice

grader.check("q7")

Filtering

Filtering is sifting out rows according to a criterion, and can be accomplished using an array or series of Trues and Falses defined by a comparison. To take a simple example, say you wanted to filter out all names with fewer than 1000 occurrences. First you could define a logical series:

# true if filtering criterion is met, false otherwise
arr = baby_names.Count > 1000

Then you can filter using that array:

# filter
baby_names_filtered = baby_names[arr]
baby_names_filtered.head()

Notice that the filtered array is much smaller than the overall array – only about 2000 rows correspond to a name occurring more than 1000 times in a year for a gender.

# compare dimensions
print(baby_names_filtered.shape)
print(baby_names.shape)

You have already encountered this concept in lab 0 when subsetting an array. For your reference, some commonly used comparison operators are given below.

Symbol	Usage	Meaning
`==`	`a == b`	Does a equal b?
`<=`	`a <= b`	Is a less than or equal to b?
`>=`	`a >= b`	Is a greater than or equal to b?
`<`	`a < b`	Is a less than b?
`>`	`a > b`	Is a greater than b?
`~`	`~p`	Returns negation of p
`\|`	`p` `\|` `q`	p OR q
`&`	`p & q`	p AND q
`^`	`p ^ q`	p XOR q (exclusive or)

What if instead you wanted to filter using multiple conditions? Here’s an example of retrieving rows with counts exceeding 1000 for only the year 2001:

# filter using two conditions
baby_names[(baby_names.Year == 2000) & (baby_names.Count > 1000)]

Question 8

Select the girl names in 2010 that were given more than 3000 times, and store them as common_girl_names_2010.

Note: Any time you use p & q to filter the dataframe, make sure to use df[df[(p) & (q)]] or df.loc[df[(p) & (q)]]). That is, make sure to wrap conditions with parentheses to ensure the intended order of operations.

common_girl_names_2010 = ...

common_girl_names_2010

grader.check("q8")

Grouping and aggregation

Grouping and aggregation are useful in generating data summaries, which are often important starting points in exploring a dataset.

Aggregation

Aggregation literally means ‘putting together’ (etymologically the word means ‘joining the herd’) – in statistics and data science, this refers to data summaries like an average, a minimum, or a measure of spread such as the sample variance or mean absolute deviation (data herding!). From a technical point of view, operations that take multiple values as inputs and return a single output are considered summaries – in other words, statistics. Some of the most common aggregations are:

sum
product
count
number of distinct values
mean
median
variance
standard deviation
minimum/maximum
quantiles

Pandas has built-in dataframe operations that compute most of these summaries across either axis (column-wise or row-wise):

.sum()
.prod()
.mean()
.median()
.var()
.std()
.nunique()
.min() and .max()
.quantile()

To illustrate these operations, let’s filter out all names in 1995.

# filter 1995 names
names_95 = baby_names[baby_names.Year == 1995]

How many individuals were counted in total in 1995? We can address that by computing a sum of the counts:

# n for 1995
names_95.Count.sum()

What is the typical frequency of all names in 1995? We can address that by computing the average count:

# average count for a name in 1995
names_95.Count.mean()

Question 9

Find how often the most common name 1995 was given and store this as names_95_max_count. Use this value to filter names_95 and find which name was most common that year. Store the filtered dataframe as names_95_most_common_name.

names_95_max_count = ...
names_95_most_common_name = ...


print("Number of people with the most frequent name in 1995 is :", names_95_max_count, "people")
print("Most frequent name in 1995 is:", names_95_most_common_name.values[0])

grader.check("q9")

Caution! If applied to the entire dataframe, the operation df.max() (or any other aggregation) will return the maximum of each column. Notice that the cell below does not return the row you found just now, but could easily be misinterpreted as such. The cell does tell you that the maximum value of sex (alphabetically last) is M and the maximum name (alphabetically last) is Zyanya and the maximum count is 5003; it does not tell you that 5003 boys were named Zyanya in 1995.

# maximum of each variable
names_95.max()

Grouping

What if you want to know the most frequent male and female names? If so, you’ll need to repeat the above operations group-wise by sex.

In general, any variable in a dataframe can be used to define a grouping structure on the rows (or, less commonly, columns). After grouping, any dataframe operations will be executed within each group, but not across groups. This can be used to generate grouped summaries, such as the maximum count for boys and girls; as a point of terminology, we’d describe this summary as ‘maximum count by sex’.

The .groupby() function defines such a structure; here is the documentation. The cell below groups the names_95 dataframe by sex. Notice that when the grouped dataframe is previewed with .head(), the first few rows are returned for each group.

# grouped dataframe
names_95_bysex = names_95.groupby('Sex')

# print
names_95_bysex.head(2)

Any aggregation operations applied to the grouped dataframe will be applied separately to the rows where Sex == M and the rows where Sex == F. For example, computing .sum() on the grouped dataframe will show the total number of individuals in the data for 1995 by sex:

# number of individuals by sex
names_95_bysex.Count.sum()

The most frequent boy and girl names can be found using .idxmax() groupwise to obtain the index of the first occurence of the maximum count for each sex, and then slicing with .loc:

# first most common names by sex
names_95.loc[names_95_bysex.Count.idxmax(), :]

Since .idxmax() gives the index of the first occurrence, these are the alphabetically first most common names; there could be ties. You know from your work so far that there are no ties for the male names; another filtering step can be used to check for ties among the female names.

# ties?
names_95[names_95_bysex.Count.max().values[0] == names_95['Count']]

So, no ties.

Question 10

Are there more girl names or boy names in 1995? Use the grouped dataframe names_95_bysex with the .count() aggregation to find the total number of names for each sex. Store the female and male counts as girl_name_count and boy_name_count, respectfully.

girl_name_count = ...
boy_name_count = ...

#print
print(girl_name_count)
print(boy_name_count)

grader.check("q10")

Chaining operations

You have already seen examples of this, but pandas and numpy operations can be chained together in sequence. For example, names_95.Count.max() is a chain with two steps: first select the Count column (.count); then compute the maximum (.max()).

Grouped summaries are often convenient to compute in a chained fashion, rather than by assigning the grouped dataframe a new name and performing operations on the resulting object. For example, finding the total number of boys and girls recorded in the 1995 data can be done with the following chain:

# repeating previous calculation, more streamlined
names_95.groupby('Sex').Count.sum()

We can take this even one step further and also perform the filtering in sequence as part of the chain:

# longer chain
baby_names[baby_names.Year == 1995].groupby('Sex').Count.sum()

Chains can get somewhat long, but they have the advantage of making codes more efficient, and often more readable. We did above in one step what took several lines before. Further, this chain can almost be read aloud:

“Take baby names, filter on year, then group by sex, then select name counts, then compute the sum.”

Let’s now consider computing the average counts of boy and girl names for each year 1990-1995. This can be accomplished by the following chain (notice it is possible to group by multiple variables).

# average counts by sex and year
baby_names[baby_names.Year <= 1995].groupby(['Year', 'Sex']).mean(numeric_only = True)

This display is not ideal. We can ‘pivot’ the table into a wide format by adding a few extra steps in the chain: change the indices to columns; then define a new shape by specifying which column should be the new row index, which should be the new column index, and which values should populate the table.

# average counts by sex and year
baby_names[baby_names.Year <= 1995].groupby(
    ['Year', 'Sex']
    ).mean(
    numeric_only = True
    ).reset_index().pivot(
    index = 'Sex', columns = 'Year', values = 'Count'
    )

Style comment: break long chains over multiple lines with indentation. The above chain is too long to be readable in one line. To balance the readability of codes with the efficiency of chaining, it is good practice to break long chains over several lines, with appropriate indentations.

Here are some rules of thumb on style.

Separate comparisons by spaces (a<b as a < b)
Split chains longer than 30-40 characters over multiple lines
Split lines between delimiters (, )
Increase indentation for lines between delimiters
For chained operations, try to get each step in the chain shown on a separate line
For functions with multiple arguments, split lines so that each argument is on its own line

Question 11

Write a chain with appropriate style to display the (first) most common boy and girl names in each of the years 2005-2015. Do this in two steps:

First filter baby_names by year, then group by year and sex, and then find the indices of first occurence of the largest counts. Store these indices as ind.
Then use .loc[] with your stored indices to slice baby_names so as to retrieve the rows corresponding to each most frequent name each year and for each sex; then pivot this table so that the columns are years, the rows are sexes, and the entries are names. Store this as pivot_names.

...
print(ind)
pivot_names

grader.check("q11")

Submission

Save the notebook.
Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
Carefully look through your notebook and verify that all computations execute correctly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
Download the notebook as an .ipynb file. This is your backup copy.
Export the notebook as PDF and upload to Gradescope.

To double-check your work, the cell below will rerun all of the autograder tests.

grader.check_all()

Lab 1: Pandas Overview

Objectives

Creating DataFrames & Basic Manipulations

Creating dataframes from scratch

Dataframe Attributes

Dataframe Indexing

Adding, removing, and renaming columns

Question 1

Question 2

Question 3

Question 4

Operations on Data Frames

Question 5

Question 6

Slicing: selecting rows and columns

Slicing with .loc: specifying index names

Slicing with .iloc: specifying entry positions

Question 7

Filtering

Question 8

Grouping and aggregation

Aggregation

Question 9

Grouping

Question 10

Chaining operations

Question 11

Submission

Slicing with `.loc`: specifying index names

Slicing with `.iloc`: specifying entry positions