(Textbook: Chapter 10)
We are in the Model phase.
In supervised learning:
Linear regression modeling was an example of this.
In unsupervised learning:
We will discuss supervised learning in a later Chapter.
url <-
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read_csv(
url,
col_names = c(
"age", "workclass", "fnlwgt", "education",
"education_1", "marital_status", "occupation", "relationship",
"race", "sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income"
)
) %>%
mutate(income = factor(income))
income is the response variable.
Divide data randomly into a training set and a test set. (See why later.)
Training set has 80% of the original rows. Test set has the remaining 20%.
We will build predictive models on the training set.
Suppose you build your model without using ANY other variables as predictors. you model would predict that a new individual:
Note: logistic_reg()
is in the parsnip package, which is loaded when tidymodels is attached.
Recall that if a logistic model has predictors \(X_1, \ldots, X_r\), then
\[\pi_i = P(y_i = 1) = \frac{\mathrm{e}^{S_i}}{1+\mathrm{e}^{S_i}},\] where \(S_i = \beta_0 +\beta_1x_{1i}+\ldots +\beta_rx_{ri}\).
Here there are no predcitors, and we estimated \(\beta_0\) to be
So we estimate the probability of being a high earner as:
Make the model classify, for every individual in the training set:
Of course it guesses <50K, each time!
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
With the null model the proportion of times we are right is proportion of subjects who are not high-earners!
Note: accuracy()
comes from package yardstick, loaded with tidymodels.
Sad reality: the higher the sensitivity, the higher the false positive rate is liable to be!
Truth
Prediction <=50K >50K
<=50K 19763 6285
>50K 0 0
Sensitivity:
False positive rate is 0%, but the null model is completely insensitive!
Let’s use capital_gains
as the predictor variable.
train_plus <- train %>%
mutate(high_earner = as.integer(income == ">50K"))
ggplot(train_plus, aes(x = capital_gain, y = high_earner)) +
geom_count(
position = position_jitter(width = 0, height = 0.05),
alpha = 0.5
) +
geom_smooth(
method = "glm", method.args = list(family = "binomial"),
color = "dodgerblue", lty = 2, se = FALSE
) +
geom_hline(aes(yintercept = 0.5), linetype = 3) +
scale_x_log10(labels = scales::dollar)
pred_1 <- train %>%
select(income) %>%
bind_cols(
predict(mod_log_1, new_data = train, type = "class")
) %>%
rename(income_log_1 = .pred_class)
pred_1
# A tibble: 26,048 × 2
income income_log_1
<fct> <fct>
1 <=50K <=50K
2 >50K <=50K
3 <=50K <=50K
4 <=50K <=50K
5 <=50K <=50K
6 <=50K <=50K
7 <=50K <=50K
8 <=50K <=50K
9 <=50K <=50K
10 <=50K <=50K
# … with 26,038 more rows
Truth
Prediction <=50K >50K
<=50K 19560 5015
>50K 203 1270
Sensitivity:
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.800
The model is a bit more accurate overall.
So far, when the model classifies:
The threshold is set at 0.5.
Other thresholds are possible.
income_probs <-
predict(mod_log_1, new_data = train, type = "prob")
income_probs %>%
group_by(rich = `.pred_>50K` > 0.25) %>%
count() %>%
mutate(pct = n / nrow(income_probs))
# A tibble: 2 × 3
# Groups: rich [2]
rich n pct
<lgl> <int> <dbl>
1 FALSE 23930 0.919
2 TRUE 2118 0.0813
More likely to classify as high-earner (increasing sensitivity, but also increasing false positive rate).
income_probs %>%
group_by(rich = `.pred_>50K` > 0.75) %>%
count() %>%
mutate(pct = n / nrow(income_probs))
# A tibble: 2 × 3
# Groups: rich [2]
rich n pct
<lgl> <int> <dbl>
1 FALSE 25123 0.964
2 TRUE 925 0.0355
Less likely to classify as a high-earner (decreasing sensitivity, but also decreasing false positive rate).
The Receiver-Operating Characteristic curve plots sensitivity against false positive rate, for all thresholds from 0 to 1.
roc_1 <-
train %>%
## we just need the true incomes from train:
select(income) %>%
## we nened the predicted probability of >50K:
bind_cols(predict(mod_log_1, new_data = train, type = "prob")) %>%
## next, compute sensitivity and specificity at many thresholds:
roc_curve(truth = income, `.pred_>50K`, event_level = "second") %>%
## makes ROC curve (a ggplot2 object) from the above data frame:
autoplot()
## print out the plot:
roc_1
It may help the reader if you refine the plot:
mod_log_all <- logistic_reg(mode = "classification") %>%
set_engine("glm") %>%
fit(
income ~ age + workclass + education + marital_status +
occupation + relationship + race + sex +
capital_gain + capital_loss + hours_per_week,
data = train
)
pred_all <- train %>% select(income) %>%
bind_cols(
predict(mod_log_all, new_data = train, type = "class")
) %>%
rename(income_log_all = .pred_class)
We divided into training set and test set, then built the model on the training set.
What’s the test set for?
You “fit” yourself to the practice questions.
The null model doesn’t “study” at all. it doesn’t do very well on the training set, or on new data.
Models with more predictors “study”:
The more predictors a model uses, the more we have to worry about overfitting.
You can try a model on the test set to estimate how well it will work in practice on new data.
Truth
Prediction <=50K >50K
<=50K 4913 1239
>50K 44 317
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.803
This time, overfitting seems not be be a problem.
Another approach to estimating performance on new data is cross-validation:
Then you have \(k\) measurements of possible performance on new data.
We’ll see cross-validation used in later Chapters.