Assignment 6
Unless otherwise stated, assignments are to be done individually. You are welcome to work with others to master the principles and approaches used to solve the homework problems, but the work you turn in should be your own. Under no circumstance should you seek out or look at solutions to assignments given in previous years.
If you need help with any problem, be sure to follow the Ed posting norms.
Overview
Continuing our example from Assignment 5, we will use survey data to investigate bias-variance trade-offs, over-fitting, under-fitting, cross-validation, and regularization.
Problem 1
Download and explore a small dataset that includes a subset of 400 responses, similar to the survey data introduced in Assignment 5, but now with a few more features.
Throughout this problem set we will be predicting the likelihood that an individual will vote for candidate A, so for convenience you may wish to create a binary (TRUE / FALSE) variable,
candidate_A
that records whether or not each individual indicated support for candidate A.Also, you may find it useful to compute AUC via the
ROCR
package, where necessary. Be sure to reference discussion notes for how to define your owncompute_auc()
function.
Problem 1, Part A
Using all the available features, fit a logistic regression model on the small dataset of 400 rows to predict the probability that an individual will vote for candidate A. Evaluate the performance of the model by computing AUC on the same data that you fit the model on.
Problem 1, Part B
Download a test set of 400 new responses. Using the model you fit in 1a, predict votes for these new individuals.
- Note that you should not re-train your model on this test set; rather, use the model you fit on the initial set of data in 1a to simply generate predictions on this new test set.
Evaluate the performance of the model on the test set by computing the AUC. Compare the performance of the model on the two datasets, and briefly comment on any differences you observe.
Problem 1, Part C
Re-train the model using a larger dataset with 9,500 observations (which includes the 400 responses from 1a, but not the 400 observations from the test set in 1b).
For the newly fitted model, evaluate its performance on both this larger dataset and the test set from 1b. Briefly comment on what you find.
Problem 2
Now we explore the idea of model selection using cross-validation and \(L^p\) regularization, to improve our model from Problem 1.
For this problem, use the same small dataset of 400 observations, and the test set that we’ve been using in Problem 1. Do not use the larger dataset of 9,500 rows.
Problem 2, Part A
Start with the 400 observations from part 1a, and divide the dataset into two components: a training set consisting of the first 300 observations, and a validation set with the remaining 100 observations. Do not shuffle the dataset before splitting.
Fit L1- and L2-regularized models on the training set with 10 values of the regularization parameter lambda
.
To implement L1 and L2 regularization in
R
, we use theglmnet
package. Theglmnet
package makes fitting regularized models simple, but it requires a model matrix instead of a formula. Reading the documentation forglmnet
is highly recommended.The number of
lambda
values to fit can be specified with thenlambda
parameter. For example, to train an L1-regularized model on a data frame namedtrain_data
, using 10 values oflambda
with a response variable in thecandidate_A
column, and justage
as a covariate, you can use this code:
x <- model.matrix(candidate_A ~ age, train_data)[, -1]
y <- train_data$candidate_A
model <- glmnet(x, y, alpha = 1, nlambda = 10, family="binomial")
We use
alpha = 1
for L1-regularization, andalpha = 0
for L2-regularization.You’ll need to adapt the above code to train a model that uses all the available features.
Once a model is fit, the actual values of
lambda
used can be extracted from the model withmodel$lambda
.
Problem 2, Part B
For each of the 20 resulting models (10 values of lambda
for each of the L1 and L2-regularized models), compute AUC on the training set (300 rows) and validation set (100 rows).
Using
predict()
withglmnet
will generate a matrix of predictions for all values oflambda
that were used to fit your model.For example, to generate predictions from the fitted model on a data frame of 100 rows named
d
, you can adapt this code:
new_x <- model.matrix(candidate_A ~ age, d)[, -1]
p <- predict(model, newx = new_x, type = "response")
- This will result in a matrix of 100 rows (corresponding to the rows of
d
) and 10 columns (corresponding to the values ofmodel$lambda
).
Plot two lines that show how AUC on the training and validation sets change as a function of the regularization parameter lambda
, for each of the L1 and L2-regularized models.
In other words, plot the value of
lambda
on the x-axis, and the value of AUC achieved on the training and validation set on the y-axis.For this plot, use a log10-scale for the x-axis (see
scale_x_log10
inggplot
).
Problem 2, Part C
i. For each of the L1 and L2-regularized models, select the value of lambda
that achieved highest AUC on the validation set.
- As a result, you should have two models, one L1 and one L2-regularized model.
ii. Report the value of AUC on the validation set achieved for each of the two selected models.
iii. Plot coefficient values for the selected L1 and L2-regularized models. Briefly comment on what you observe.
You can extract the model coefficients with the
coef
command.To specifically extract the coefficients for a model that uses
lambda = l
, set thes
parameter tol
, e.g.,coef(model, s = l)
.
iv. Finally, of the two models, select the single model that acheived the highest validation AUC. Compute the AUC for this model on the test set.
- To generate predictions from the fitted model on a data frame named
d
, using a specific value oflambda
, you can write:
new_x <- model.matrix(candidate_A ~ age, d)[, -1]
p <- predict(model, newx = new_x, type = "response", s = l)
- In the code above,
l
is the value oflambda
you would like to use for generating the predictions.
Problem 3
Please provide a brief (1-2 paragraph) status update on your project, including your specific contributions thus far and your plans for the upcoming week.
Submission
Prepare a short report detailing your results.
Please submit the following:
- your report as a single PDF file
- a single, fully functional
R
script or markdown file that we can run to reproduce all the numerical results and plots in your report.
We will put the necessary data files into the same directory as your script before running it.
Be sure to read these report tips before preparing your submission. Reports that are difficult to parse may lose credit.
Please submit your work on Canvas.