Decision Trees

(Textbook: Chapter 11)

The Big Picture

We are in the Model phase.

Stuff Needed

data(m111survey, package = "bcscr")
data(verlander, package = "tigerstats")
library(tidyverse)
library(tree)
library(tigerTree)

Types of Learning

  • supervised (Chapters 10 and 11)
  • unsupervised (Chapter 12)

Supervised Learning

In supervised learning:

  • your data has a response variable;
  • the other variables are predictor variables;
  • your goal is to find a function (from predictor variables to response) that can be used to predict value of a response variable for a new individual whose predictor variables are known.

Unsupervised Learning

In unsupervised learning:

  • there is no clear response variable;
  • your goal is to find patterns or groupings in the data.

We will discuss supervised learning in this Chapter.

Stuff We Need

data(m111survey, package = "tigerstats")
data(verlander, package = "tigerstats")
library(tidyverse)
library(tree)
library(tigerTree)

Introduction to Decision Trees

Tree Models

Trees are a convenient way to:

  • describe how a set of explanatory variables are related to a response variable
  • predict values of the response variable for a new observation

Two types of trees:

  • classification trees (response is a factor)
  • regression trees (response is numerical)

In this course we will work with classification trees only. (Textbook uses the term “decision tree”.)

Packages for Tree-Making

  • The textbook uses package rpart.
  • To introduce the idea of trees we will use the older tree package.
  • Then we will teach the rpart way, with tidymodels, as the text does.
  • You’ll do HW using the textbook approach.

Our First Classification Tree

Predict sex from:

  • height
  • fastest
trSex <- tree(
  sex ~ fastest + height,
  data = m111survey
)
plot(trSex)
text(trSex)

Nodes

A tree has nodes:

  • the root node is at the top
  • each node can be split into two child nodes, based on the value of a variable
  • nodes that don’t get split are called terminal nodes

The majority sex in each terminal node is listed under the node.

The Tree Model (more detail)

trSex
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 71 97.280 female ( 0.56338 0.43662 )  
   2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )  
     4) height < 67.375 29  8.700 female ( 0.96552 0.03448 )  
       8) fastest < 122.5 24  0.000 female ( 1.00000 0.00000 ) *
       9) fastest > 122.5 5  5.004 female ( 0.80000 0.20000 ) *
     5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )  
      10) fastest < 97.5 6  5.407 female ( 0.83333 0.16667 ) *
      11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *
   3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )  
     6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) *
     7) fastest > 106.5 15  0.000 male ( 0.00000 1.00000 ) *

How Does This Work?

How does the tree decide to make its splits?

How does it decide when to stop splitting?

By Tree-Control!

Under the hood, tree uses the control argument and the tree.control function:

trSex <- tree(
  sex ~ fastest + height, 
  data = m111survey,
  control = tree.control(
                nobs = nrow(m111survey),
                mincut = 5,
                minsize = 10,
                mindev = 0.01
            )
)

nobs

This is the number of observations you are using to build your tree.

We are using all the rows in m111survey.

So nobs = nrow(m111survey)

minsize

This is the smallest-sized node that you will consider splitting up into two child nodes.

By default, minsize = 10, but we could change that.

mincut

If you plan to split a node, each of the two child nodes must be at least mincut in size.

By default, mincut = 5, but you can change this.

Note: mincut must not be more than half of minsize.

mindev

mindev pays attention to the deviance.

But what is deviance?

Deviance

Each node has a deviance. When you are predicting sex, the deviance formula is:

\[-2 \left(n_{\text{female}} \ln(p_{\text{female}}) + n_{\text{male}} \ln(p_{\text{male}}) \right),\]

where:

  • \(n_{\text{female}} =\) number of females in the node;
  • \(p_{\text{female}} =\) proportion of females in the node;
  • \(n_{\text{male}} =\) number of males in the node;
  • \(p_{\text{male}} =\) proportion of males in the node;

\(\ln\) is the natural logarithm function.

Deviance

For the root node of the tree the deviance is:

\[ -2\left(40\ln(40/71) + 31\ln(31/71)\right) \approx 97.280.\]

Deviance is a measure of how “pure” a node is. The closer the node is to being all male or all female, the closer its deviance is to 0.

Rewriting Deviance

Consider the following function:

\[f(p) = -2\left(p\ln(p) + (1-p)\ln(1-p)\right),\] where \(0 < p < 1\).

Graph of f

Notice:

  • \(f\) was smallest for \(p\) near 0 and near 1
  • \(f\) was biggest at \(p = 1/2\)

Some Algebra:

Let \(n\) be the number of items at the node, and consider:

\[ \frac{\text{deviance}}{n} = -2 \left(\frac{n_{\text{female}}}{n} \ln(p_{\text{female}}) + \frac{n_{\text{male}}}{n} \ln(p_{\text{male}})\right) \]

This is

\[-2 \left(p_{\text{female}} \ln(p_{\text{female}}) + p_{\text{male}} \ln(p_{\text{male}})\right),\]

which is

\[-2 \left(p_{\text{female}} \ln(p_{\text{female}}) + p_{\text{male}} \ln(p_{\text{male}})\right)\]

and this is

\[-2 \left(p_{\text{female}} \ln(p_{\text{female}}) + (1-p_{\text{female}}) \ln(1-p_{\text{female}})\right)\]

which is \(f(p_{female})\).

So:

\[\text{deviance of the node} = n \times f(p_{female}).\]

Accordingly …

  • Deviance is biggest when the node is “most impure” (\(p_{\text{female}}\) and \(p_{\text{male}}\) both near \(1/2\).)
  • Deviance is smallest when the node is maximally pure (\(p_{\text{female}} = 0\) and \(p_{\text{male}} = 1\), or vice versa).

Deviance in Splitting

The idea in splitting a node is to choose a split so that:

the sum of the deviances of the two child nodes is as small as possible.

mindev in Splitting

mindev sets a limit on this. To split a node into two child nodes, the deviance of that node must be at least:

\[\text{mindev} \times \text{root deviance}.\]

By default, mindev = 0.01, but you can change this.

A Walk-Through

Let’s examine the construction of our trSex model, step-by-step.

Beginning (no splits yet)

1) root 71 97.280 female ( 0.56338 0.43662 ) 

1) root 71 97.280 female ( 0.56338 0.43662 )  
  2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )
  3) height > 69.5 28 19.070 male ( 0.10714 0.89286 ) 

3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )  
  6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) *
  7) fastest > 106.5 15  0.000 male ( 0.00000 1.00000 ) *

2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )  
     4) height < 67.375 29  8.700 female ( 0.96552 0.03448 )  
     5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )

5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )  
  10) fastest < 97.5 6  5.407 female ( 0.83333 0.16667 ) *
  11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *

4) height < 67.375 29  8.700 female ( 0.96552 0.03448 )  
    8) fastest < 122.5 24  0.000 female ( 1.00000 0.00000 ) *
    9) fastest > 122.5 5  5.004 female ( 0.80000 0.20000 ) *

More Predictor Variables

Trees can work with any number of explanatory (predictor) variables. Each predictor must be a factor or a numerical variable.

Could we predict sex better if we used more predictor variables?

Let’s use all the other variables in m111survey!

Use All Other Variables

trSex2 <- tree(sex ~ ., data = m111survey)
plot(trSex2); text(trSex2)

Note that we are sticking with the default values of control.

The Result

A Useful Summary

summary(trSex2)

Classification tree:
tree(formula = sex ~ ., data = m111survey)
Variables actually used in tree construction:
[1] "ideal_ht" "GPA"      "fastest" 
Number of terminal nodes:  4 
Residual mean deviance:  0.1975 = 12.64 / 64 
Misclassification error rate: 0.04412 = 3 / 68 

The Output

Lead-up info:

Classification tree:
tree(formula = sex ~ ., data = m111survey)
Variables actually used in tree construction:
[1] "ideal_ht" "GPA"      "fastest" 
Number of terminal nodes:  4 

Output: Deviance

Residual mean deviance:  0.1975 = 12.64 / 64 

R added the deviance at all four terminal nodes, then divided by:

\[\text{observations} - \text{terminal nodes} = 68 - 4 = 64.\]

The smaller the residual mean, deviance, the more “pure” the terminal nodes are, on average.

Output: Error Rate

At each terminal node, you predict sex based on which sex is the majority at the node.

The minority observations are “mis-classed.”

Mis-Classed Observations

You can find the three mis-classed observations below:

trSex2
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

1) root 68 93.320 female ( 0.55882 0.44118 )  
  2) ideal_ht < 71 39 15.780 female ( 0.94872 0.05128 )  
    4) GPA < 2.785 6  7.638 female ( 0.66667 0.33333 ) *
    5) GPA > 2.785 33  0.000 female ( 1.00000 0.00000 ) *
  3) ideal_ht > 71 29  8.700 male ( 0.03448 0.96552 )  
    6) fastest < 92.5 5  5.004 male ( 0.20000 0.80000 ) *
    7) fastest > 92.5 24  0.000 male ( 0.00000 1.00000 ) *

Missing Observations

Wait a minute! The error rate was “3 out of 68”. But m111survey has 71 observations:

nrow(m111survey)
[1] 71

It turns out that three observations were missing values for predictor variables, so R could not use them to check the tree. This will happen frequently.

Caution About the Error Rate

The 3/68 sounds great, but this is how the tree did on the same data that was used to build it.

If you used the tree to predict new observations of a similar sort, it probably would not do as well, especially if you made a tree with lots of tiny, pure terminal nodes!

Control the Size of Your Tree

Let’s allow the tree to grow “big” (i.e., have as many terminal nodes as possible).

trSex3 <- tree(sex ~ ., data = m111survey,
               control = tree.control(
                 nobs = nrow(m111survey),
                 mincut = 1,
                 minsize = 2,
                 mindev = 0
               ))
plot(trSex3); text(trSex3)

The Model

The Summary

summary(trSex3)

Classification tree:
tree(formula = sex ~ ., data = m111survey, control = tree.control(nobs = nrow(m111survey), 
    mincut = 1, minsize = 2, mindev = 0))
Variables actually used in tree construction:
[1] "ideal_ht" "GPA"      "height"   "sleep"    "fastest" 
Number of terminal nodes:  6 
Residual mean deviance:  0 = 0 / 62 
Misclassification error rate: 0 = 0 / 68 

All terminal nodes are pure!

(But this tree is “overgrown”. We would not expect it to do a great job on new data.)

Dealing with Factor Predictors

seat didn’t make it into the “big” tree model. Is it relevant to sex at all?

trSex4 <- tree(sex ~ seat, data = m111survey)
plot(trSex4); text(trSex4)

What does this graph mean??

Factors in a Tree Plot:

In the plot, levels of a factor are coded by letters: a,b,c, …

trSex4
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

1) root 71 97.28 female ( 0.5634 0.4366 )  
  2) seat: 1_front 27 32.82 female ( 0.7037 0.2963 ) *
  3) seat: 2_middle,3_back 44 60.91 male ( 0.4773 0.5227 ) *

seat and sex

seat is not much related to sex. Look at the high mis-classification rate:

summary(trSex4)
Classification tree:
tree(formula = sex ~ seat, data = m111survey)
Number of terminal nodes:  2 
Residual mean deviance:  1.358 = 93.72 / 69 
Misclassification error rate: 0.4085 = 29 / 71 

Practice

data(iris)
View(iris)
help(iris)

Use the iris data to make a tree model that predicts Species from Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.

How many terminal nodes does it have?

What is the mis-classification rate?

Prediction with Classification Trees

Thinking about Error Rates and Trees Sizes

We said that the mis-classification rate can underestimate the error a tree model will have when applied to new observations.

We also have options about “how large” to grow our tree.

  • What’s the “best” tree size?
  • How can we properly estimate the error rate for new observations?

Let’s answer these questions with an example.

Justin Verlander

data("verlander", package = "tigerstats")
View(verlander)
?tigerstats::verlander

The Pitch-fx machine classifies pitches by type.

Research Question:

Can we predict how the machine will classify a pitch? What variables are important in determining pitch-type?

First, Look Around a Bit

Code
verlander %>%
  ggplot(aes(x = speed)) +
  geom_density()

Look with Box Plots

Code
verlander %>%
  ggplot(aes(x = pitch_type, y = speed)) +
  geom_boxplot()

Remove an Unusual Observation

He had one very slow pitch (probably an intentional ball) Let’s get rid of it:

ver2 <- verlander %>% filter(speed > 60)

We’ll work with the ver2 data frame from now on.

More Looking

Try various plots (see Addins for cloud plots), etc.

Unimportant Variables

gamedate probably doesn’t matter. Anyway, it’s neither a factor nor numerical, so the trees won’t work with it. Let’s remove it:

ver2 <-
  ver2 %>% 
  select(-gamedate)

A Classification Tree

Code
trMod <- tree(
  pitch_type ~ ., 
  data = ver2
)
plot(trMod)
text(trMod)

A Closer Look

trMod
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 15306 44070.0 FF ( 0.1666013 0.1773814 0.4413955 0.1320397 0.0825820 )  
   2) speed < 91.75 6565 14350.0 CU ( 0.3859863 0.4135567 0.0004570 0.0085301 0.1914699 )  
     4) pfx_x < -2.025 2649  1152.0 CH ( 0.9554549 0.0000000 0.0011325 0.0211401 0.0222726 ) *
     5) pfx_x > -2.025 3916  4870.0 CU ( 0.0007661 0.6933095 0.0000000 0.0000000 0.3059244 )  
      10) pfx_z < -2.365 2755   866.1 CU ( 0.0000000 0.9633394 0.0000000 0.0000000 0.0366606 ) *
      11) pfx_z > -2.365 1161   519.6 SL ( 0.0025840 0.0525409 0.0000000 0.0000000 0.9448751 ) *
   3) speed > 91.75 8741  9652.0 FF ( 0.0018305 0.0000000 0.7725661 0.2248027 0.0008008 )  
     6) pfx_x < -8.305 2063  1075.0 FT ( 0.0000000 0.0000000 0.0727096 0.9272904 0.0000000 )  
      12) pfx_x < -8.525 1726   112.6 FT ( 0.0000000 0.0000000 0.0052144 0.9947856 0.0000000 ) *
      13) pfx_x > -8.525 337   458.2 FT ( 0.0000000 0.0000000 0.4183976 0.5816024 0.0000000 ) *
     7) pfx_x > -8.305 6678   943.2 FF ( 0.0023959 0.0000000 0.9887691 0.0077868 0.0010482 ) *

A Summary of the Tree

summary(trMod)

Classification tree:
tree(formula = pitch_type ~ ., data = ver2)
Variables actually used in tree construction:
[1] "speed" "pfx_x" "pfx_z"
Number of terminal nodes:  6 
Residual mean deviance:  0.2648 = 4052 / 15300 
Misclassification error rate: 0.03319 = 508 / 15306 

Graphical Look

Using the Cloud Addin, we can get something like this:

Code
lattice::cloud(speed ~ pfx_x * pfx_z,
    data = ver2,
    screen = list(x = -90,
            y = 8,
            z = 0),
    groups = pitch_type,
    auto.key = list(
        space = "top",
        title = "pitch_type",
        cex.title = 1,
        columns = 3),
    zoom = 0.75,
    par.settings = latticeExtra::custom.theme(
        symbol = viridis::viridis(5),
         bg = "gray90", fg = "gray20", pch = 19
    ))

An Overgrown Tree

A tree with many terminal nodes:

trMod2 <- tree(
  pitch_type ~ .,
  data = ver2,
  control = tree.control(
    nobs = nrow(ver2), 
    mincut = 1, 
    minsize = 2, 
    mindev = 0
  )
)
plot(trMod2)
text(trMod2)

Yuck!

What’s Going On?

summary(trMod2)

Classification tree:
tree(formula = pitch_type ~ ., data = ver2, control = tree.control(nobs = nrow(ver2), 
    mincut = 1, minsize = 2, mindev = 0))
Number of terminal nodes:  261 
Residual mean deviance:  0 = 0 / 15040 
Misclassification error rate: 0 = 0 / 15306 

The tree kept growing, making more and more splits until no further helpful splits could be made.

Which is Better …

… a tree with lots of terminal nodes, or fewer terminal nodes?

And can we estimate how well our tree will do on new data?

Training, Quiz, Test

We’ll subdivide our data into three groups:

  • training set
  • quiz set
  • test set

Then …

  • We’ll build as many trees as we like, using only the training set.
  • We’ll try each tree model on the quiz set, and note the error rate.
  • We’ll pick the model with the lowest error rate (or one that we like for other good reasons).
  • We will try that model, and ONLY that model, on the test set.
  • The error rate on the test set is our best estimate of the error rate for new data.

Subdivision

Let’s say we’ll assign:

  • 60% of the data to the training set;
  • 20% to the quiz set;
  • 20% to the test set.

This assignment should be done randomly!

Subdivision

The package tigerTree contains a function that will divide a data frame randomly into the desired groups.

library(tigerTree)
dfs <- divideTrainTest(
  seed = 3030, 
  prop.train = 0.6,
  prop.quiz = 0.2,
  data = ver2
)
verTrain <- dfs$train
verQuiz <- dfs$quiz
verTest <- dfs$test

Build Trees with Training Set:

First, a “conservative” tree:

tr_small <- tree(
  pitch_type ~ ., 
  data = verTrain
)
summary(tr_small)

Classification tree:
tree(formula = pitch_type ~ ., data = verTrain)
Variables actually used in tree construction:
[1] "speed" "pfx_x" "pfx_z"
Number of terminal nodes:  6 
Residual mean deviance:  0.2488 = 2283 / 9177 
Misclassification error rate: 0.0318 = 292 / 9183 

Build Trees with Training Set

Next, an “overgrown” tree:

tr_big <- tree(
  pitch_type ~ ., 
  data = verTrain,
  control = tree.control(
    nobs = nrow(verTrain), 
    mincut = 1, minsize = 2, mindev = 0)
)

Take a Look …

summary(tr_big)

Classification tree:
tree(formula = pitch_type ~ ., data = verTrain, control = tree.control(nobs = nrow(verTrain), 
    mincut = 1, minsize = 2, mindev = 0))
Number of terminal nodes:  157 
Residual mean deviance:  0 = 0 / 9026 
Misclassification error rate: 0 = 0 / 9183 

No errors (as you would expect when all the nodes are pure).

Try Small Tree on the Quiz Set:

This is done with the tryTree function from the tigerTree package:

tryTree(
  mod = tr_small, 
  testSet = verQuiz, 
  truth = verQuiz$pitch_type
)
Residual mean deviance:  0.271 = 827.9 / 3055
Misclassification error rate:  0.03398 = 104 / 3061
Confusion matrix:
          truth
prediction   CH   CU   FF   FT   SL
        CH  506    0    1   11   10
        CU    0  544    0    0   23
        FF    3    0 1314    9    2
        FT    0    0   31  367    0
        SL    0   14    0    0  226

Error rate bigger than on the training set!

Try Big Tree on the Test Set

Again use tryTree but with the tr_big model:

tryTree(
  mod = tr_big, 
  testSet = verQuiz, 
  truth = verQuiz$pitch_type
)
Residual mean deviance:  0.0987 = 292.2 / 2961
Misclassification error rate:  0.02777 = 85 / 3061
Confusion matrix:
          truth
prediction   CH   CU   FF   FT   SL
        CH  498    0    1    3    9
        CU    0  553    0    0   12
        FF    2    0 1322   21    0
        FT    2    0   22  363    0
        SL    7    5    1    0  240

Error rate also bigger than on the training set!

What’s the best Tree Size?

There are very scientific ways to do this, but we’ll just work by hand. Tune the tree by hand with this local Shiny app:

tuneTree(
  pitch_type ~ .,
  data = verTrain,
  testSet = verQuiz,
  truth = verQuiz$pitch_type
)

My Choice

I finally decided to go with:

tr_chosen <- tree(
  pitch_type ~ .,
  data = verTrain,
  control = tree.control(
    nobs = nrow(verTrain), 
    mincut = 7,
    minsize = 14,
    mindev = 0.0003)
)
Variables actually used in tree construction:
[1] "speed"   "pfx_x"   "pz"      "season"  "pfx_z"   "px"      "pitches"
Number of terminal nodes:  41 
Residual mean deviance:  0.07468 = 682.7 / 9142 
Misclassification error rate: 0.01568 = 144 / 9183 

On the Quiz Set …

… my choice gave:

On the Quiz Set …

… this choice gave:

tryTree(tr_chosen, testSet = verQuiz, truth = verQuiz$pitch_type)
Residual mean deviance:  0.1189 = 358.5 / 3015
Misclassification error rate:  0.02516 = 77 / 3061
Confusion matrix:
          truth
prediction   CH   CU   FF   FT   SL
        CH  503    0    1    0    9
        CU    0  553    0    0   13
        FF    3    0 1327   23    2
        FT    2    0   18  364    0
        SL    1    5    0    0  237

On the Test Set, I Got …

tryTree(tr_chosen, testSet = verTest, truth = verTest$pitch_type)
Residual mean deviance:  0.1259 = 379.6 / 3015
Misclassification error rate:  0.02417 = 74 / 3062
Confusion matrix:
          truth
prediction   CH   CU   FF   FT   SL
        CH  539    0    1    1   14
        CU    0  558    0    0    9
        FF    5    0 1309   14    2
        FT    2    0   18  353    0
        SL    1    7    0    0  229

Note

Our tree does pretty well at imitating Pitch-fx, but it’s not exactly how Pitch-fx makes its classifications:

  • Pitch-fx may use other predictors besides the one in verlander
  • It probably uses more sophisticated prediction methods