| name | age | desire | likesDogs |
|---|---|---|---|
| Dorothy | 12.00 | Kansas | TRUE |
| Scarecrow | 0.02 | brains | TRUE |
| Tinman | 20.00 | heart | FALSE |
| Lion | 18.00 | courage | FALSE |
| Toto | 6.00 | kibbles | NA |
(Sections 7.4 - 7.6)
| name | age | desire | likesDogs |
|---|---|---|---|
| Dorothy | 12.00 | Kansas | TRUE |
| Scarecrow | 0.02 | brains | TRUE |
| Tinman | 20.00 | heart | FALSE |
| Lion | 18.00 | courage | FALSE |
| Toto | 6.00 | kibbles | NA |
A data set is tidy when:
Case
An individual unit under study. In a data frame in R, the rows correspond to cases.Variable (in Data Analysis)
A measurement made on the individuals in a study.
Can you use a matrix to store a data set?
No!
Matrices are atomic vectors: they can handle only one type of data at a time.
A data set can have many types of data.
| name | age | desire | likesDogs |
|---|---|---|---|
| Dorothy | 12.00 | Kansas | TRUE |
| Scarecrow | 0.02 | brains | TRUE |
| Tinman | 20.00 | heart | FALSE |
| Lion | 18.00 | courage | FALSE |
| Toto | 6.00 | kibbles | NA |
… there were a non-atomic two dimensional data structure (so that not all columns have to be of the same data type).
Then:
name and desire could be character vectorsage could be doublelikesToto could be logicalData Frame
A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.
m111surveyIt’s in the bcscr package:
In the R Studio IDE, we can get a look at it …
… and in R we can learn more about it:
Print it all out to the Console:
But that’s unwieldy. To see just the first few rows, try:
height ideal_ht sleep fastest weight_feel love_first extra_life seat
1 76 78 9.5 119 1_underweight no yes 1_front
2 74 76 7.0 110 2_about_right no yes 2_middle
3 64 NA 9.0 85 2_about_right no no 2_middle
4 62 65 7.0 100 1_underweight no no 1_front
GPA enough_Sleep sex diff.ideal.act.
1 3.56 no male 2
2 2.50 no male 2
3 3.80 no female NA
4 3.50 no female 3
'data.frame': 71 obs. of 12 variables:
$ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
$ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
$ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
$ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
$ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
$ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
$ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
$ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
$ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
$ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
[1] "height" "ideal_ht" "sleep" "fastest"
[5] "weight_feel" "love_first" "extra_life" "seat"
[9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
You can use a $ to isolate a variable in a frame:
The result is a vector and may be treated as such:
… is done as with any vector. For example, the mean fastest speed our subjects drove their cars:
Speeds at least 150 miles per hour:
Some of the variables in m111survey are factors.
Example: seat
Factors are examples of categorical variables.
Categorical Variable (in Data Analysis)
A variable whose values cannot be expressed meaningfully by numbers.
Usually a factor has only a small number of possible values, called levels.
Start with a vector:
Make a factor from it:
You could set the order of the levels
A categorical variable should not be made into a factor if it has a very large number of possible values.
Suppose your cases are people. The following are OK as factors:
Bad as factors:
To convert a factor to a non-factor:
data.frame()… works as with matrices.
When you ask for just one column, the result is a vector:
Unless you set drop to FALSE:
You can select rows, too:
subset()subset() has three important parameters:
x: the data frame to pick from;subset: a boolean expression to pick rows;select: the columns you want (default is all of them)Most people don’t name the first two arguments:
Height in feet, instead of inches:
You can add the new variable to the frame:
When you are interested only in whether or not a variable has a particular value, ifelse() can help:
Use map_values() in the plyr package:
seat3 <- plyr::mapvalues(
m111survey$seat,
from = c("1_front", "2_middle", "3_back"),
to = c("Front", "Middle", "Back")
)
str(seat3) Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...
You can save this back to the data frame if you like:
cut()heightClass <- cut(
m111survey$height,
breaks = c(-Inf, 65, 70, Inf),
labels = c("Short", "Medium","Tall"),
right = FALSE
)
str(heightClass) Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 3 3 1 2 ...
right = FALSE means, e.g, 65 is “Medium”, not “Short”right = TRUE means, e.g., 65 is “Short”, not “Medium”If you don’t want a variable in the data frame, get rid of it like this: