# Initialize Otter
import otter
= otter.Notebook("lab7-classification.ipynb") grader
Lab 7: Classification
import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm
# disable row limit for plotting
alt.data_transformers.disable_max_rows()# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')
Lab 7: Classification
This lab covers binary regression and classification using logistic regression models. The logistic regression model for a binary outcome \(y \in \{0, 1\}\) posits that the probability of the outcome of interest follows a logistic function of the explanatory variable \(x\):
\[ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} \]
More commonly, the model is written in terms of the log-odds of the outcome of interest:
\[ \log\left[\frac{P(Y = 1)}{P(Y = 0)}\right] = \beta_0 + \beta_1 x \]
Additional explanatory variables can be included in the model by specifying a linear predictor with additional \(\beta_j x_j\) terms.
Logistic regression models represent the probability of an outcome as a function of one or more explanatory variables; fitted probabilities can be coerced to hard classifications by thresholding.
For this lab, we’ll revisit the SEDA data from an earlier assignment. Below are the log median incomes and estimated achievement gaps on math and reading tests for 625 school districts in California:
= pd.read_csv('data/seda.csv').dropna()
seda seda.head()
The estimated achievement gap is positive if boys outperform girls, and negative if girls outperform boys. We can therefore define a binary indicator of the direction of the achievement gap:
'favors_boys'] = (seda.gap > 0)
seda[ seda.head()
You may recall having calculated the proportion of districts in various income brackets with a math gap favoring boys.
We will now consider the closely related problem of estimating the probability that a district has a math gap favoring boys based on the median income of the district.
Since we’re only considering math gaps, we’ll filter out the gap estimates on reading tests.
= seda[seda.subject == 'math'].loc[:, ['log_income', 'favors_boys']] reg_data
Let’s set aside the data for 100 randomly chosen districts to use later in quantifying the classification accuracy of the model.
Question 1: data partitioning
Set aside 100 observations at random for testing. Do this by selecting a random subset of 100 indices. Choose a different RNG seed from your neighbor so that you can compare results based on different training sets.
# select 100 indices at random
= ...
seed
np.random.seed(seed)= np.random.choice(
idx
...= ...
size = ...
replace
).tolist()
# partition data
= reg_data.loc[idx]
test = reg_data.drop(index = idx) train
"q1") grader.check(
Exploratory analysis
Previously you had binned income into brackets and constructed a table of the proportion of districts in each income bracket with a math gap favoring boys. It turns out that binning and aggregation is a useful exploratory strategy for binary regression. Your table from before would have been something like this:
# define income bracket
'income_bracket'] = pd.cut(train.log_income, 10)
train[
# compute mean and standard deviation of each variable by bracket
= train.groupby('income_bracket').agg(func = ['mean', 'std'])
tbl
# fix column indexing and remove 'artificial' brackets containing only min and max values
= ["_".join(a) for a in tbl.columns.to_flat_index()]
tbl.columns = tbl[tbl.favors_boys_std > 0]
tbl
# display
tbl
We can plot these proportions, with standard deviations, as functions of income. Since standard deviations are fairly high, the variability bands only show 0.4 standard deviations in either direction.
= alt.Chart(tbl).mark_line(point = True).encode(
trend = alt.X('log_income_mean', title = 'log income'),
x = alt.Y('favors_boys_mean', title = 'Pr(math gap favors boys)')
y
)
= alt.Chart(tbl).transform_calculate(
band = 'datum.favors_boys_mean - 0.4*datum.favors_boys_std',
lwr = 'datum.favors_boys_mean + 0.4*datum.favors_boys_std'
upr = 0.3).encode(
).mark_area(opacity = 'log_income_mean',
x = alt.Y('lwr:Q', scale = alt.Scale(domain = [0, 1])),
y = 'upr:Q'
y2
)
+ band trend
We can regard these proportions as estimates of the probability that the achievement gap in math favors boys. Thus, the figure above displays the exact relationship we will attempt to model, only as a continuous function of income rather than at 8 discrete points.
Question 2: model assumptions
The logistic regression model assumes that the probability of the outcome of interest is a monotonic function of the explanatory variable(s). Examine the plot above and discuss with your neighbor. Does this monotinicity assumption seem to be true? Why or why not?
Type your answer here, replacing this text.
Model fitting
We’ll fit a simple model of the probility that the math gap favors boys as a logistic function of log income:
\[ \log\left[\frac{P(\text{gap favors boys})}{1 - P(\text{gap favors boys})}\right] = \beta_0 + \beta_1 \log(\text{median income}) \]
The data preparations are exactly the same as in linear regression: we’ll obtain a vector of the response outcome and an explanatory variable matrix containing log median income and a constant (for the intercept).
# explanatory variable matrix
= sm.add_constant(train.log_income)
x
# response
= train.favors_boys y
The model is fit using statsmodels.Logit()
. Note that the endogenous variable (the response) can be either Boolean (take values True
and False
) or integer (take values 0
or 1
).
# fit the model
= sm.Logit(endog = y, exog = x)
model
# store the fitted model
= model.fit()
fit
# display parameter estimates
fit.params
A coefficient table remains useful for logistic regression:
= pd.DataFrame(
coef_tbl 'estimate': fit.params,
{'standard error': np.sqrt(fit.cov_params().values.diagonal())},
= x.columns
index
) coef_tbl
Question 3: confidence intervals
Compute 99% confidence intervals for the model parameters. Store the result as a dataframe called param_ci
.
Hint: the syntax is identical to that based on sm.OLS
; this is also mentioned in the lecture slides.
# compute 99% confidence intervals
= ...
param_ci
# display
= {0: 'lwr', 1: 'upr'}, inplace = True)
param_ci.rename(columns param_ci
"q3") grader.check(
We can superimpose the predicted probabilities for a fine grid of log median incomes on the data figure we had made previously to compare the fitted model with the observed values:
# grid of log income values
= pd.DataFrame({
grid_df 'log_income': np.linspace(9, 14, 200)
})
# add predictions
'pred'] = fit.predict(sm.add_constant(grid_df))
grid_df[
# plot predictions against income
= alt.Chart(grid_df).mark_line(color = 'red', opacity = 0.5).encode(
model_viz = 'log_income',
x = 'pred'
y
)
# superimpose on data figure
+ band + model_viz trend
Depending on your training sample, the model may or may not align well with the computed proportions, but it should be mostly or entirely within the 0.4-standard-deviation band.
To interpret the estimated relationship, recall that if median income is doubled, the log-odds changes by:
\[ \hat{\beta}_1\log(2\times\text{median income}) - \hat{\beta}_1 \log(\text{median income}) = \hat{\beta}_1 \log(2) \]
Now, exponentiating gives the estimated multiplicative change in odds:
\[ \exp\left\{\log(\text{baseline odds}) + \hat{\beta}_1 \log(2)\right\} = \text{baseline odds} \times e^{\hat{\beta}_1 \log(2)} \]
So computing \(e^{\hat{\beta}_1 \log(2)}\) gives a quantity we can readily interpret:
*np.log(2)) np.exp(fit.params.log_income
The exact number will depend a little bit on the data partition you used to compute the estimate, but the answer should be roughly consistent with the following interpretation:
Each doubling of median income is associated with an estimated four-fold increase in the odds that a school district has a math gap favoring boys.
Classification
Now we’ll consider the task of classifying new school districts by the predicted direction of their math achievement gap.
A straightforward classification rule would be:
\[ \text{gap predicted to favor boys} \quad\Longleftrightarrow\quad \widehat{Pr}(\text{gap favors boys}) > 0.5 \]
We can obtain the estimated probabilities using .predict()
, and construct the classifier manually. To assess the accuracy, we’ll want to arrange the classifications side-by-side with the observed outcomes:
# compute predicted probabilities on test set
= fit.predict(sm.add_constant(test.log_income))
preds
# construct classifier
= pd.DataFrame({
pred_df 'observation': test.favors_boys,
'prediction': preds > 0.5
})
# preview
pred_df.head()
Note that the testing partition was used here – to get an unbiased estimate of the classification accuracy, we need data that were not used in fitting the model.
Cross-tabulating observed and predicted outcomes gives a detailed view of the accuracy and error:
# cross-tabulate classifications with observed outcomes
= pred_df.groupby(['observation', 'prediction']).size()
pred_tbl pred_tbl
The entries where observation
and prediction
have the same value are counts of the number of districts correctly classified; those where they do not match are counts of errors.
Question 4: overall classification accuracy
Compute the overall classification accuracy – the proportion of districts that were correctly classified.
= ... accuracy
"q4") grader.check(
Often class-wise accuracy rates are more informative, because there are two possible types of error:
- A district that has a math gap favoring girls is classified as having a math gap favoring boys
- A district that has a math gap favoring boys is classified as having a math gap favoring girls
You may notice that there were more errors of one type than another in your result above. This is not conveyed by reporting the overall accuracy rate.
For a clearer picture, we can find the proportion of errors among by outcome:
'error'] = (pred_df.observation != pred_df.prediction)
pred_df[= pred_df[pred_df.observation == True].error.mean()
fnr = pred_df[pred_df.observation == False].error.mean()
fpr = 1 - fpr
tpr = 1 - fnr
tnr
print('false positive rate: ', fpr)
print('false negative rate: ', fnr)
print('true positive rate (sensitivity): ', tpr)
print('true negative rate (specificity): ', tnr)
Question 5: make your own classifier
Define a new classifier by adjusting the probability threshold. Compute and print the false positive, false negative, true positive, and true negative rates. Experiment until you achieve a better balance between errors of each type.
# construct classifier
= pd.DataFrame({
new_pred_df
...
...
})
# compute error rates
'error'] = ...
new_pred_df[= ...
new_fnr = ...
new_fpr = ...
new_tpr = ...
new_tnr
# print
print('false positive rate: ', new_fpr)
print('false negative rate: ', new_fnr)
print('true positive rate (sensitivity): ', new_tpr)
print('true negative rate (specificity): ', new_tnr)
"q5") grader.check(
Submission
- Save the notebook.
- Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
- Carefully look through your notebook and verify that all computations execute correctly and all graphics are displayed clearly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
- Download the notebook as an
.ipynb
file. This is your backup copy. - Export the notebook as PDF and upload to Gradescope.
To double-check your work, the cell below will rerun all of the autograder tests.
grader.check_all()