Predictive Modeling

(Textbook: Chapter 10)

The Big Picture

We are in the Model phase.

Types of Learning

supervised (Chapters 10 and 11)
unsupervised (Chapter 12)

Supervised Learning

In supervised learning:

your data has a response variable;
the other variables are predictor variables;
your goal is to find a function (from predictor variables to response) that can be used to predict value of a response variable for a new individual whose predictor variables are known.

Linear regression modeling was an example of this.

Unsupervised Learning

In unsupervised learning:

there is no clear response variable;
your goal is to find patterns or groupings in the data.

We will discuss supervised learning in a later Chapter.

Predictive Modeling Basics

Packages Needed Now

library(tidyverse)

Income Data

url <-
  "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read_csv(
  url,
  col_names = c(
    "age", "workclass", "fnlwgt", "education", 
    "education_1", "marital_status", "occupation", "relationship", 
    "race", "sex", "capital_gain", "capital_loss", "hours_per_week", 
    "native_country", "income"
  )
) %>%
  mutate(income = factor(income))

income is the response variable.

Package tidymodels

library(tidymodels)

Training/Test

Divide data randomly into a training set and a test set. (See why later.)

set.seed(364)
census_parts <- census %>%
  initial_split(prop = 0.8)
train <- census_parts %>%
  training()
test <- census_parts %>%
  testing()

Training set has 80% of the original rows. Test set has the remaining 20%.

We will build predictive models on the training set.

Percentage of High-Earners

pi_bar <- train %>%
  count(income) %>%
  mutate(pct = n / sum(n)) %>%
  filter(income == ">50K") %>%
  pull(pct)

pi_bar

[1] 0.2412853

Null Model

Suppose you build your model without using ANY other variables as predictors. you model would predict that a new individual:

has a probability of 0.2412853 of earning over 50K;
would classify him/her as earning less than 50K.

The Null Model, Computed

mod_null <- logistic_reg(mode = "classification") %>%
  set_engine("glm") %>%
  fit(income ~ 1, data = train)

Note: logistic_reg() is in the parsnip package, which is loaded when tidymodels is attached.

The Null Model, Displayed

broom::tidy(mod_null)

# A tibble: 1 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    -1.15    0.0145     -79.1       0

Logistic Review 1

Recall that if a logistic model has predictors \(X_1, \ldots, X_r\), then

\[\pi_i = P(y_i = 1) = \frac{\mathrm{e}^{S_i}}{1+\mathrm{e}^{S_i}},\] where \(S_i = \beta_0 +\beta_1x_{1i}+\ldots +\beta_rx_{ri}\).

Logistic Review 2

Here there are no predcitors, and we estimated \(\beta_0\) to be

nm_int <- broom::tidy(mod_null)[[1, 2]]
nm_int

[1] -1.145646

So we estimate the probability of being a high earner as:

exp(nm_int) / (1 + exp(nm_int))

[1] 0.2412853

Many Predictions

Make the model classify, for every individual in the training set:

pred_null <- train %>%
 select(income) %>%
  bind_cols(
    predict(mod_null, new_data = train, type = "class")
  ) %>%
  rename(income_null = .pred_class)

Of course it guesses <50K, each time!

Accuracy

accuracy(pred_null, income, income_null)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.759

With the null model the proportion of times we are right is proportion of subjects who are not high-earners!

Note: accuracy() comes from package yardstick, loaded with tidymodels.

Confusion Matrix

confusion_null <- pred_null %>%
  conf_mat(truth = income, estimate = income_null)
confusion_null

          Truth
Prediction <=50K  >50K
     <=50K 19763  6285
     >50K      0     0

Terminology 1

A true positive is a situation where the model classifies a case as the 1-value (high-earner) and the case is a 1.
true negative: model classifies case as the 0-value (low-earner) and the case is a 0.
false_positive: model classifies case as the 1-value (low-earner) but the case is a 0.
false negative: model classifies case as the 0-value (low-earner) but the case is a 1.

Terminology 2

sensitivity is the true positive rate: proportion of all 1s that were correctly classifed as 1s. We would like this to be high!
specificity is the true negative rate: proportion of all 0s that were correctly classified as 0s. We would like this to be high!
the false positive rate is \(1 - \text{specificity}\), the proportion of all 0s that were incorrectly classified as 1s. We want this to be low!

Sad reality: the higher the sensitivity, the higher the false positive rate is liable to be!

Rates Computed

          Truth
Prediction <=50K  >50K
     <=50K 19763  6285
     >50K      0     0

Sensitivity:

0 / (6285 + 0)   ## sensitivity

[1] 0

19763 / (19763 + 0)   ## specificity

[1] 1

False positive rate is 0%, but the null model is completely insensitive!

One Predictor Variable

Let’s use capital_gains as the predictor variable.

mod_log_1 <- logistic_reg(mode = "classification") %>%
  set_engine("glm") %>%
  fit(income ~ capital_gain, data = train)

Graph of our Model

Expand for Code

train_plus <- train %>%
  mutate(high_earner = as.integer(income == ">50K"))

ggplot(train_plus, aes(x = capital_gain, y = high_earner)) + 
  geom_count(
    position = position_jitter(width = 0, height = 0.05), 
    alpha = 0.5
  ) + 
  geom_smooth(
    method = "glm", method.args = list(family = "binomial"), 
    color = "dodgerblue", lty = 2, se = FALSE
  ) + 
  geom_hline(aes(yintercept = 0.5), linetype = 3) + 
  scale_x_log10(labels = scales::dollar)

Make Many Predictors

pred_1 <- train %>%
  select(income) %>% 
  bind_cols(
    predict(mod_log_1, new_data = train, type = "class")
  ) %>%
  rename(income_log_1 = .pred_class)
pred_1

# A tibble: 26,048 × 2
   income income_log_1
   <fct>  <fct>       
 1 <=50K  <=50K       
 2 >50K   <=50K       
 3 <=50K  <=50K       
 4 <=50K  <=50K       
 5 <=50K  <=50K       
 6 <=50K  <=50K       
 7 <=50K  <=50K       
 8 <=50K  <=50K       
 9 <=50K  <=50K       
10 <=50K  <=50K       
# … with 26,038 more rows

Confusion Matrix

confusion_log_1 <- pred_1 %>%
  conf_mat(truth = income, estimate = income_log_1)
confusion_log_1

          Truth
Prediction <=50K  >50K
     <=50K 19560  5015
     >50K    203  1270

Sensitivity and False Positive Rate

          Truth
Prediction <=50K  >50K
     <=50K 19560  5015
     >50K    203  1270

1270 / (5015 + 1270)  ## sensitivity

[1] 0.2020684

19560 / (19560 + 203)  ## specificity

[1] 0.9897283

Make yardstick Compute for You!

Sensitivity:

pred_1 %>% 
  sensitivity(
    truth = income, 
    estimate = income_log_1, 
    event_level = "second"
  )

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 sensitivity binary         0.202

Specificity

pred_1 %>% 
  specificity(
    truth = income, 
    estimate = income_log_1, 
    event_level = "second"
  )

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 specificity binary         0.990

Accuracy

accuracy(pred_1, income, income_log_1)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.800

The model is a bit more accurate overall.

ROC Curves

How to Make the Call?

So far, when the model classifies:

as 1 if predicted probability of being 1 is more than 0.50;
otherwise it classifies as 0.

The threshold is set at 0.5.

Other thresholds are possible.

A Threshold of 0.25

income_probs <- 
    predict(mod_log_1, new_data = train, type = "prob")
income_probs %>%
  group_by(rich = `.pred_>50K` > 0.25) %>%
  count() %>%
  mutate(pct = n / nrow(income_probs))

# A tibble: 2 × 3
# Groups:   rich [2]
  rich      n    pct
  <lgl> <int>  <dbl>
1 FALSE 23930 0.919 
2 TRUE   2118 0.0813

More likely to classify as high-earner (increasing sensitivity, but also increasing false positive rate).

A Threshold of 0.75

income_probs %>%
  group_by(rich = `.pred_>50K` > 0.75) %>%
  count() %>%
  mutate(pct = n / nrow(income_probs))

# A tibble: 2 × 3
# Groups:   rich [2]
  rich      n    pct
  <lgl> <int>  <dbl>
1 FALSE 25123 0.964 
2 TRUE    925 0.0355

Less likely to classify as a high-earner (decreasing sensitivity, but also decreasing false positive rate).

ROC Curves

The Receiver-Operating Characteristic curve plots sensitivity against false positive rate, for all thresholds from 0 to 1.

An ROC Curve

Expand for code

roc_1 <-
  train %>%
  ## we just need the true incomes from train:
  select(income) %>% 
  ## we nened the predicted probability of >50K:
  bind_cols(predict(mod_log_1, new_data = train, type = "prob")) %>% 
  ## next, compute sensitivity and specificity at many thresholds:
  roc_curve(truth = income, `.pred_>50K`, event_level = "second") %>% 
  ## makes ROC curve (a ggplot2 object) from the above data frame:
  autoplot()
## print out the plot:
roc_1

Refine

It may help the reader if you refine the plot:

Expand for code

roc_1 +
  labs(
    x = "false positive rate"
  )

More Predictors

Use All Your Predictors

mod_log_all <- logistic_reg(mode = "classification") %>%
  set_engine("glm") %>%
  fit(
    income ~ age + workclass + education + marital_status + 
      occupation + relationship + race + sex + 
      capital_gain + capital_loss + hours_per_week, 
    data = train
  )

pred_all <- train %>% select(income) %>%
  bind_cols(
    predict(mod_log_all, new_data = train, type = "class")
  ) %>%
  rename(income_log_all = .pred_class)

Confusion

pred_all %>%
  conf_mat(truth = income, estimate = income_log_all)

          Truth
Prediction <=50K  >50K
     <=50K 18395  2493
     >50K   1368  3792

Sensitivity

pred_all %>% 
  sensitivity(
    truth = income, 
    estimate = income_log_all,
    event_level = "second")

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 sensitivity binary         0.603

Specificity

pred_all %>% 
  specificity(
    truth = income, 
    estimate = income_log_all,
    event_level = "second")

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 specificity binary         0.931

Accuracy

accuracy(pred_all, income, income_log_all)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.852

ROC Curve

Expand for Code

roc_all <- train %>%
  select(income) %>% 
  bind_cols(predict(mod_log_all, new_data = train, type = "prob")) %>% 
  roc_curve(truth = income, `.pred_>50K`, event_level = "second") %>%
  autoplot()

roc_all

Evaluating Models with New Data

Training vs. Test Sets

We divided into training set and test set, then built the model on the training set.

What’s the test set for?

Overfitting: the Study Analogy

When you study for a test, you might work on practice tests.
You keep working on the practice questions until you do really well at them.
Then you take the real test.
The real test has different questions.
You do better on the real test than if you had not studied …
… but not as well as you did on the practice questions.

You “fit” yourself to the practice questions.

Overfitting

The null model doesn’t “study” at all. it doesn’t do very well on the training set, or on new data.

Models with more predictors “study”:

They shape themselves to “fit” the training data.
They will do better on new data than the null model will, but …
they might do worse on new data than they did on the training set (overfitting) …
… because they shape themselves to chance variation in the training data as well as to actual patterns (that will be in new data too).

The more predictors a model uses, the more we have to worry about overfitting.

Test Sets Estimate the Actual Fit

You can try a model on the test set to estimate how well it will work in practice on new data.

pred_1_test <- test %>%
  select(income) %>% 
  bind_cols(
    predict(mod_log_1, new_data = test, type = "class")
  ) %>%
  rename(income_log_1 = .pred_class)

Confusion Matrix

confusion_log_1_test <- pred_1_test %>%
  conf_mat(truth = income, estimate = income_log_1)
confusion_log_1_test

          Truth
Prediction <=50K >50K
     <=50K  4913 1239
     >50K     44  317

Sensitivity and False Positive Rate

          Truth
Prediction <=50K >50K
     <=50K  4913 1239
     >50K     44  317

1239 / (1239 + 317)  ## sensitivity

[1] 0.7962725

4913 / (4913 + 44)  ## specificity

[1] 0.9911237

Accuracy

accuracy(pred_1_test, income, income_log_1)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.803

This time, overfitting seems not be be a problem.

Cross-Validation

Another approach to estimating performance on new data is cross-validation:

divide your data into \(k\) pieces (often \(k=10\))
pull out first piece, build model on rest, then test on withheld piece
pull out second piece, build model on rest, then test on withheld piece
… and so on.

Then you have \(k\) measurements of possible performance on new data.

We’ll see cross-validation used in later Chapters.