Lab 7: Classification

# Initialize Otter
import otter
grader = otter.Notebook("lab7-classification.ipynb")

import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm

# disable row limit for plotting
alt.data_transformers.disable_max_rows()
# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')

Lab 7: Classification

This lab covers binary regression and classification using logistic regression models. The logistic regression model for a binary outcome \(y \in \{0, 1\}\) posits that the probability of the outcome of interest follows a logistic function of the explanatory variable \(x\):

\[ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} \]

More commonly, the model is written in terms of the log-odds of the outcome of interest:

\[ \log\left[\frac{P(Y = 1)}{P(Y = 0)}\right] = \beta_0 + \beta_1 x \]

Additional explanatory variables can be included in the model by specifying a linear predictor with additional \(\beta_j x_j\) terms.

Logistic regression models represent the probability of an outcome as a function of one or more explanatory variables; fitted probabilities can be coerced to hard classifications by thresholding.

For this lab, we’ll revisit the SEDA data from an earlier assignment. Below are the log median incomes and estimated achievement gaps on math and reading tests for 625 school districts in California:

seda = pd.read_csv('data/seda.csv').dropna()
seda.head()

The estimated achievement gap is positive if boys outperform girls, and negative if girls outperform boys. We can therefore define a binary indicator of the direction of the achievement gap:

seda['favors_boys'] = (seda.gap > 0)
seda.head()

You may recall having calculated the proportion of districts in various income brackets with a math gap favoring boys.

We will now consider the closely related problem of estimating the probability that a district has a math gap favoring boys based on the median income of the district.

Since we’re only considering math gaps, we’ll filter out the gap estimates on reading tests.

reg_data = seda[seda.subject == 'math'].loc[:, ['log_income', 'favors_boys']]

Let’s set aside the data for 100 randomly chosen districts to use later in quantifying the classification accuracy of the model.

Question 1: data partitioning

Set aside 100 observations at random for testing. Do this by selecting a random subset of 100 indices. Choose a different RNG seed from your neighbor so that you can compare results based on different training sets.

# select 100 indices at random
seed = ...
np.random.seed(seed)
idx = np.random.choice(
    ...
    size = ...
    replace = ...
).tolist()

# partition data
test = reg_data.loc[idx]
train = reg_data.drop(index = idx)

grader.check("q1")

Exploratory analysis

Previously you had binned income into brackets and constructed a table of the proportion of districts in each income bracket with a math gap favoring boys. It turns out that binning and aggregation is a useful exploratory strategy for binary regression. Your table from before would have been something like this:

# define income bracket
train['income_bracket'] = pd.cut(train.log_income, 10)

# compute mean and standard deviation of each variable by bracket
tbl = train.groupby('income_bracket').agg(func = ['mean', 'std'])

# fix column indexing and remove 'artificial' brackets containing only min and max values
tbl.columns = ["_".join(a) for a in tbl.columns.to_flat_index()]
tbl = tbl[tbl.favors_boys_std > 0]

# display
tbl

We can plot these proportions, with standard deviations, as functions of income. Since standard deviations are fairly high, the variability bands only show 0.4 standard deviations in either direction.


trend = alt.Chart(tbl).mark_line(point = True).encode(
    x = alt.X('log_income_mean', title = 'log income'),
    y = alt.Y('favors_boys_mean', title = 'Pr(math gap favors boys)')
)

band = alt.Chart(tbl).transform_calculate(
    lwr = 'datum.favors_boys_mean - 0.4*datum.favors_boys_std',
    upr = 'datum.favors_boys_mean + 0.4*datum.favors_boys_std'
).mark_area(opacity = 0.3).encode(
    x = 'log_income_mean',
    y = alt.Y('lwr:Q', scale = alt.Scale(domain = [0, 1])),
    y2 = 'upr:Q'
)

trend + band

We can regard these proportions as estimates of the probability that the achievement gap in math favors boys. Thus, the figure above displays the exact relationship we will attempt to model, only as a continuous function of income rather than at 8 discrete points.

Question 2: model assumptions

The logistic regression model assumes that the probability of the outcome of interest is a monotonic function of the explanatory variable(s). Examine the plot above and discuss with your neighbor. Does this monotinicity assumption seem to be true? Why or why not?

Type your answer here, replacing this text.

Model fitting

We’ll fit a simple model of the probility that the math gap favors boys as a logistic function of log income:

\[ \log\left[\frac{P(\text{gap favors boys})}{1 - P(\text{gap favors boys})}\right] = \beta_0 + \beta_1 \log(\text{median income}) \]

The data preparations are exactly the same as in linear regression: we’ll obtain a vector of the response outcome and an explanatory variable matrix containing log median income and a constant (for the intercept).

# explanatory variable matrix
x = sm.add_constant(train.log_income)

# response
y = train.favors_boys

The model is fit using statsmodels.Logit(). Note that the endogenous variable (the response) can be either Boolean (take values True and False) or integer (take values 0 or 1).

# fit the model
model = sm.Logit(endog = y, exog = x)

# store the fitted model
fit = model.fit()

# display parameter estimates
fit.params

A coefficient table remains useful for logistic regression:

coef_tbl = pd.DataFrame(
    {'estimate': fit.params,
    'standard error': np.sqrt(fit.cov_params().values.diagonal())},
    index = x.columns
)
coef_tbl

Question 3: confidence intervals

Compute 99% confidence intervals for the model parameters. Store the result as a dataframe called param_ci.

Hint: the syntax is identical to that based on sm.OLS; this is also mentioned in the lecture slides.

# compute 99% confidence intervals
param_ci = ...

# display
param_ci.rename(columns = {0: 'lwr', 1: 'upr'}, inplace = True)
param_ci

grader.check("q3")

We can superimpose the predicted probabilities for a fine grid of log median incomes on the data figure we had made previously to compare the fitted model with the observed values:

# grid of log income values
grid_df = pd.DataFrame({
    'log_income': np.linspace(9, 14, 200)
})

# add predictions
grid_df['pred'] = fit.predict(sm.add_constant(grid_df))

# plot predictions against income
model_viz = alt.Chart(grid_df).mark_line(color = 'red', opacity = 0.5).encode(
    x = 'log_income',
    y = 'pred'
)

# superimpose on data figure
trend + band + model_viz

Depending on your training sample, the model may or may not align well with the computed proportions, but it should be mostly or entirely within the 0.4-standard-deviation band.

To interpret the estimated relationship, recall that if median income is doubled, the log-odds changes by:

\[ \hat{\beta}_1\log(2\times\text{median income}) - \hat{\beta}_1 \log(\text{median income}) = \hat{\beta}_1 \log(2) \]

Now, exponentiating gives the estimated multiplicative change in odds:

\[ \exp\left\{\log(\text{baseline odds}) + \hat{\beta}_1 \log(2)\right\} = \text{baseline odds} \times e^{\hat{\beta}_1 \log(2)} \]

So computing \(e^{\hat{\beta}_1 \log(2)}\) gives a quantity we can readily interpret:

np.exp(fit.params.log_income*np.log(2))

The exact number will depend a little bit on the data partition you used to compute the estimate, but the answer should be roughly consistent with the following interpretation:

Each doubling of median income is associated with an estimated four-fold increase in the odds that a school district has a math gap favoring boys.

Classification

Now we’ll consider the task of classifying new school districts by the predicted direction of their math achievement gap.

A straightforward classification rule would be:

\[ \text{gap predicted to favor boys} \quad\Longleftrightarrow\quad \widehat{Pr}(\text{gap favors boys}) > 0.5 \]

We can obtain the estimated probabilities using .predict(), and construct the classifier manually. To assess the accuracy, we’ll want to arrange the classifications side-by-side with the observed outcomes:

# compute predicted probabilities on test set
preds = fit.predict(sm.add_constant(test.log_income))

# construct classifier
pred_df = pd.DataFrame({
    'observation': test.favors_boys,
    'prediction': preds > 0.5
})

# preview
pred_df.head()

Note that the testing partition was used here – to get an unbiased estimate of the classification accuracy, we need data that were not used in fitting the model.

Cross-tabulating observed and predicted outcomes gives a detailed view of the accuracy and error:

# cross-tabulate classifications with observed outcomes
pred_tbl = pred_df.groupby(['observation', 'prediction']).size()
pred_tbl

The entries where observation and prediction have the same value are counts of the number of districts correctly classified; those where they do not match are counts of errors.

Question 4: overall classification accuracy

Compute the overall classification accuracy – the proportion of districts that were correctly classified.

accuracy = ...

grader.check("q4")

Often class-wise accuracy rates are more informative, because there are two possible types of error:

A district that has a math gap favoring girls is classified as having a math gap favoring boys
A district that has a math gap favoring boys is classified as having a math gap favoring girls

You may notice that there were more errors of one type than another in your result above. This is not conveyed by reporting the overall accuracy rate.

For a clearer picture, we can find the proportion of errors among by outcome:

pred_df['error'] = (pred_df.observation != pred_df.prediction)
fnr = pred_df[pred_df.observation == True].error.mean()
fpr = pred_df[pred_df.observation == False].error.mean()
tpr = 1 - fpr
tnr = 1 - fnr

print('false positive rate: ', fpr)
print('false negative rate: ', fnr)
print('true positive rate (sensitivity): ', tpr)
print('true negative rate (specificity): ', tnr)

Question 5: make your own classifier

Define a new classifier by adjusting the probability threshold. Compute and print the false positive, false negative, true positive, and true negative rates. Experiment until you achieve a better balance between errors of each type.

# construct classifier
new_pred_df = pd.DataFrame({
    ...
    ...
})

# compute error rates
new_pred_df['error'] = ...
new_fnr = ...
new_fpr = ...
new_tpr = ...
new_tnr = ...

# print
print('false positive rate: ', new_fpr)
print('false negative rate: ', new_fnr)
print('true positive rate (sensitivity): ', new_tpr)
print('true negative rate (specificity): ', new_tnr)

grader.check("q5")

Submission

Save the notebook.
Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
Carefully look through your notebook and verify that all computations execute correctly and all graphics are displayed clearly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
Download the notebook as an .ipynb file. This is your backup copy.
Export the notebook as PDF and upload to Gradescope.

To double-check your work, the cell below will rerun all of the autograder tests.

grader.check_all()