(Textbook: Chapter 11)
We are in the Model phase.
In supervised learning:
In unsupervised learning:
We will discuss supervised learning in this Chapter.
Trees are a convenient way to:
Two types of trees:
In this course we will work with classification trees only. (Textbook uses the term “decision tree”.)
A tree has nodes:

The majority sex in each terminal node is listed under the node.
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 71 97.280 female ( 0.56338 0.43662 )
2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )
4) height < 67.375 29 8.700 female ( 0.96552 0.03448 )
8) fastest < 122.5 24 0.000 female ( 1.00000 0.00000 ) *
9) fastest > 122.5 5 5.004 female ( 0.80000 0.20000 ) *
5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )
10) fastest < 97.5 6 5.407 female ( 0.83333 0.16667 ) *
11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *
3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )
6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) *
7) fastest > 106.5 15 0.000 male ( 0.00000 1.00000 ) *
How does the tree decide to make its splits?
How does it decide when to stop splitting?
Under the hood, tree uses the control argument and the tree.control function:
This is the number of observations you are using to build your tree.
We are using all the rows in m111survey.
So nobs = nrow(m111survey)
This is the smallest-sized node that you will consider splitting up into two child nodes.
By default, minsize = 10, but we could change that.
If you plan to split a node, each of the two child nodes must be at least mincut in size.
By default, mincut = 5, but you can change this.
Note: mincut must not be more than half of minsize.
mindev pays attention to the deviance.
But what is deviance?
Each node has a deviance. When you are predicting sex, the deviance formula is:
\[-2 \left(n_{\text{female}} \ln(p_{\text{female}}) + n_{\text{male}} \ln(p_{\text{male}}) \right),\]
where:
\(\ln\) is the natural logarithm function.
For the root node of the tree the deviance is:
\[ -2\left(40\ln(40/71) + 31\ln(31/71)\right) \approx 97.280.\]
Deviance is a measure of how “pure” a node is. The closer the node is to being all male or all female, the closer its deviance is to 0.
Consider the following function:
\[f(p) = -2\left(p\ln(p) + (1-p)\ln(1-p)\right),\] where \(0 < p < 1\).
Let \(n\) be the number of items at the node, and consider:
\[ \frac{\text{deviance}}{n} = -2 \left(\frac{n_{\text{female}}}{n} \ln(p_{\text{female}}) + \frac{n_{\text{male}}}{n} \ln(p_{\text{male}})\right) \]
This is
\[-2 \left(p_{\text{female}} \ln(p_{\text{female}}) + p_{\text{male}} \ln(p_{\text{male}})\right),\]
which is
\[-2 \left(p_{\text{female}} \ln(p_{\text{female}}) + p_{\text{male}} \ln(p_{\text{male}})\right)\]
and this is
\[-2 \left(p_{\text{female}} \ln(p_{\text{female}}) + (1-p_{\text{female}}) \ln(1-p_{\text{female}})\right)\]
which is \(f(p_{female})\).
So:
\[\text{deviance of the node} = n \times f(p_{female}).\]
The idea in splitting a node is to choose a split so that:
the sum of the deviances of the two child nodes is as small as possible.
mindev in Splittingmindev sets a limit on this. To split a node into two child nodes, the deviance of that node must be at least:
\[\text{mindev} \times \text{root deviance}.\]
By default, mindev = 0.01, but you can change this.
Let’s examine the construction of our trSex model, step-by-step.
1) root 71 97.280 female ( 0.56338 0.43662 )
1) root 71 97.280 female ( 0.56338 0.43662 )
2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )
3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )
3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )
6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) *
7) fastest > 106.5 15 0.000 male ( 0.00000 1.00000 ) *
2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )
4) height < 67.375 29 8.700 female ( 0.96552 0.03448 )
5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )
5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )
10) fastest < 97.5 6 5.407 female ( 0.83333 0.16667 ) *
11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *
4) height < 67.375 29 8.700 female ( 0.96552 0.03448 )
8) fastest < 122.5 24 0.000 female ( 1.00000 0.00000 ) *
9) fastest > 122.5 5 5.004 female ( 0.80000 0.20000 ) *
Trees can work with any number of explanatory (predictor) variables. Each predictor must be a factor or a numerical variable.
Could we predict sex better if we used more predictor variables?
Let’s use all the other variables in m111survey!
Note that we are sticking with the default values of control.
Lead-up info:
Classification tree:
tree(formula = sex ~ ., data = m111survey)
Variables actually used in tree construction:
[1] "ideal_ht" "GPA" "fastest"
Number of terminal nodes: 4
Residual mean deviance: 0.1975 = 12.64 / 64
R added the deviance at all four terminal nodes, then divided by:
\[\text{observations} - \text{terminal nodes} = 68 - 4 = 64.\]
The smaller the residual mean, deviance, the more “pure” the terminal nodes are, on average.
At each terminal node, you predict sex based on which sex is the majority at the node.
The minority observations are “mis-classed.”
You can find the three mis-classed observations below:
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 68 93.320 female ( 0.55882 0.44118 )
2) ideal_ht < 71 39 15.780 female ( 0.94872 0.05128 )
4) GPA < 2.785 6 7.638 female ( 0.66667 0.33333 ) *
5) GPA > 2.785 33 0.000 female ( 1.00000 0.00000 ) *
3) ideal_ht > 71 29 8.700 male ( 0.03448 0.96552 )
6) fastest < 92.5 5 5.004 male ( 0.20000 0.80000 ) *
7) fastest > 92.5 24 0.000 male ( 0.00000 1.00000 ) *
Wait a minute! The error rate was “3 out of 68”. But m111survey has 71 observations:
It turns out that three observations were missing values for predictor variables, so R could not use them to check the tree. This will happen frequently.
The 3/68 sounds great, but this is how the tree did on the same data that was used to build it.
If you used the tree to predict new observations of a similar sort, it probably would not do as well, especially if you made a tree with lots of tiny, pure terminal nodes!
Let’s allow the tree to grow “big” (i.e., have as many terminal nodes as possible).
Classification tree:
tree(formula = sex ~ ., data = m111survey, control = tree.control(nobs = nrow(m111survey),
mincut = 1, minsize = 2, mindev = 0))
Variables actually used in tree construction:
[1] "ideal_ht" "GPA" "height" "sleep" "fastest"
Number of terminal nodes: 6
Residual mean deviance: 0 = 0 / 62
Misclassification error rate: 0 = 0 / 68
All terminal nodes are pure!
(But this tree is “overgrown”. We would not expect it to do a great job on new data.)
seat didn’t make it into the “big” tree model. Is it relevant to sex at all?
What does this graph mean??
In the plot, levels of a factor are coded by letters: a,b,c, …
seat is not much related to sex. Look at the high mis-classification rate:
Classification tree:
tree(formula = sex ~ seat, data = m111survey)
Number of terminal nodes: 2
Residual mean deviance: 1.358 = 93.72 / 69
Misclassification error rate: 0.4085 = 29 / 71
Use the iris data to make a tree model that predicts Species from Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.
How many terminal nodes does it have?
What is the mis-classification rate?
We said that the mis-classification rate can underestimate the error a tree model will have when applied to new observations.
We also have options about “how large” to grow our tree.
Let’s answer these questions with an example.
The Pitch-fx machine classifies pitches by type.
Research Question:
Can we predict how the machine will classify a pitch? What variables are important in determining pitch-type?
He had one very slow pitch (probably an intentional ball) Let’s get rid of it:
We’ll work with the ver2 data frame from now on.
Try various plots (see Addins for cloud plots), etc.
gamedate probably doesn’t matter. Anyway, it’s neither a factor nor numerical, so the trees won’t work with it. Let’s remove it:
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 15306 44070.0 FF ( 0.1666013 0.1773814 0.4413955 0.1320397 0.0825820 )
2) speed < 91.75 6565 14350.0 CU ( 0.3859863 0.4135567 0.0004570 0.0085301 0.1914699 )
4) pfx_x < -2.025 2649 1152.0 CH ( 0.9554549 0.0000000 0.0011325 0.0211401 0.0222726 ) *
5) pfx_x > -2.025 3916 4870.0 CU ( 0.0007661 0.6933095 0.0000000 0.0000000 0.3059244 )
10) pfx_z < -2.365 2755 866.1 CU ( 0.0000000 0.9633394 0.0000000 0.0000000 0.0366606 ) *
11) pfx_z > -2.365 1161 519.6 SL ( 0.0025840 0.0525409 0.0000000 0.0000000 0.9448751 ) *
3) speed > 91.75 8741 9652.0 FF ( 0.0018305 0.0000000 0.7725661 0.2248027 0.0008008 )
6) pfx_x < -8.305 2063 1075.0 FT ( 0.0000000 0.0000000 0.0727096 0.9272904 0.0000000 )
12) pfx_x < -8.525 1726 112.6 FT ( 0.0000000 0.0000000 0.0052144 0.9947856 0.0000000 ) *
13) pfx_x > -8.525 337 458.2 FT ( 0.0000000 0.0000000 0.4183976 0.5816024 0.0000000 ) *
7) pfx_x > -8.305 6678 943.2 FF ( 0.0023959 0.0000000 0.9887691 0.0077868 0.0010482 ) *
Using the Cloud Addin, we can get something like this:
lattice::cloud(speed ~ pfx_x * pfx_z,
data = ver2,
screen = list(x = -90,
y = 8,
z = 0),
groups = pitch_type,
auto.key = list(
space = "top",
title = "pitch_type",
cex.title = 1,
columns = 3),
zoom = 0.75,
par.settings = latticeExtra::custom.theme(
symbol = viridis::viridis(5),
bg = "gray90", fg = "gray20", pch = 19
))A tree with many terminal nodes:
Classification tree:
tree(formula = pitch_type ~ ., data = ver2, control = tree.control(nobs = nrow(ver2),
mincut = 1, minsize = 2, mindev = 0))
Number of terminal nodes: 261
Residual mean deviance: 0 = 0 / 15040
Misclassification error rate: 0 = 0 / 15306
The tree kept growing, making more and more splits until no further helpful splits could be made.
… a tree with lots of terminal nodes, or fewer terminal nodes?
And can we estimate how well our tree will do on new data?
We’ll subdivide our data into three groups:
Let’s say we’ll assign:
This assignment should be done randomly!
The package tigerTree contains a function that will divide a data frame randomly into the desired groups.
First, a “conservative” tree:
Classification tree:
tree(formula = pitch_type ~ ., data = verTrain)
Variables actually used in tree construction:
[1] "speed" "pfx_x" "pfx_z"
Number of terminal nodes: 6
Residual mean deviance: 0.2488 = 2283 / 9177
Misclassification error rate: 0.0318 = 292 / 9183
Next, an “overgrown” tree:
Classification tree:
tree(formula = pitch_type ~ ., data = verTrain, control = tree.control(nobs = nrow(verTrain),
mincut = 1, minsize = 2, mindev = 0))
Number of terminal nodes: 157
Residual mean deviance: 0 = 0 / 9026
Misclassification error rate: 0 = 0 / 9183
No errors (as you would expect when all the nodes are pure).
This is done with the tryTree function from the tigerTree package:
Residual mean deviance: 0.271 = 827.9 / 3055
Misclassification error rate: 0.03398 = 104 / 3061
Confusion matrix:
truth
prediction CH CU FF FT SL
CH 506 0 1 11 10
CU 0 544 0 0 23
FF 3 0 1314 9 2
FT 0 0 31 367 0
SL 0 14 0 0 226
Error rate bigger than on the training set!
Again use tryTree but with the tr_big model:
Residual mean deviance: 0.0987 = 292.2 / 2961
Misclassification error rate: 0.02777 = 85 / 3061
Confusion matrix:
truth
prediction CH CU FF FT SL
CH 498 0 1 3 9
CU 0 553 0 0 12
FF 2 0 1322 21 0
FT 2 0 22 363 0
SL 7 5 1 0 240
Error rate also bigger than on the training set!
There are very scientific ways to do this, but we’ll just work by hand. Tune the tree by hand with this local Shiny app:
I finally decided to go with:
Variables actually used in tree construction:
[1] "speed" "pfx_x" "pz" "season" "pfx_z" "px" "pitches"
Number of terminal nodes: 41
Residual mean deviance: 0.07468 = 682.7 / 9142
Misclassification error rate: 0.01568 = 144 / 9183
… my choice gave:
… this choice gave:
Our tree does pretty well at imitating Pitch-fx, but it’s not exactly how Pitch-fx makes its classifications:
verlander