name | age | desire | likesDogs |
---|---|---|---|
Dorothy | 12.00 | Kansas | TRUE |
Scarecrow | 0.02 | brains | TRUE |
Tinman | 20.00 | heart | FALSE |
Lion | 18.00 | courage | FALSE |
Toto | 6.00 | kibbles | NA |
(Sections 7.4 - 7.6)
name | age | desire | likesDogs |
---|---|---|---|
Dorothy | 12.00 | Kansas | TRUE |
Scarecrow | 0.02 | brains | TRUE |
Tinman | 20.00 | heart | FALSE |
Lion | 18.00 | courage | FALSE |
Toto | 6.00 | kibbles | NA |
A data set is tidy when:
Case
An individual unit under study. In a data frame in R, the rows correspond to cases.Variable (in Data Analysis)
A measurement made on the individuals in a study.
Can you use a matrix to store a data set?
No!
Matrices are atomic vectors: they can handle only one type of data at a time.
A data set can have many types of data.
name | age | desire | likesDogs |
---|---|---|---|
Dorothy | 12.00 | Kansas | TRUE |
Scarecrow | 0.02 | brains | TRUE |
Tinman | 20.00 | heart | FALSE |
Lion | 18.00 | courage | FALSE |
Toto | 6.00 | kibbles | NA |
… there were a non-atomic two dimensional data structure (so that not all columns have to be of the same data type).
Then:
name
and desire
could be character vectorsage
could be doublelikesToto
could be logicalData Frame
A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.
m111survey
It’s in the bcscr package:
In the R Studio IDE, we can get a look at it …
… and in R we can learn more about it:
Print it all out to the Console:
But that’s unwieldy. To see just the first few rows, try:
height ideal_ht sleep fastest weight_feel love_first extra_life seat
1 76 78 9.5 119 1_underweight no yes 1_front
2 74 76 7.0 110 2_about_right no yes 2_middle
3 64 NA 9.0 85 2_about_right no no 2_middle
4 62 65 7.0 100 1_underweight no no 1_front
GPA enough_Sleep sex diff.ideal.act.
1 3.56 no male 2
2 2.50 no male 2
3 3.80 no female NA
4 3.50 no female 3
'data.frame': 71 obs. of 12 variables:
$ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
$ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
$ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
$ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
$ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
$ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
$ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
$ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
$ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
$ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
[1] "height" "ideal_ht" "sleep" "fastest"
[5] "weight_feel" "love_first" "extra_life" "seat"
[9] "GPA" "enough_Sleep" "sex" "diff.ideal.act."
You can use a $
to isolate a variable in a frame:
The result is a vector and may be treated as such:
… is done as with any vector. For example, the mean fastest speed our subjects drove their cars:
Speeds at least 150 miles per hour:
Some of the variables in m111survey
are factors.
Example: seat
Factors are examples of categorical variables.
Categorical Variable (in Data Analysis)
A variable whose values cannot be expressed meaningfully by numbers.
Usually a factor has only a small number of possible values, called levels.
Start with a vector:
Make a factor from it:
You could set the order of the levels
A categorical variable should not be made into a factor if it has a very large number of possible values.
Suppose your cases are people. The following are OK as factors:
Bad as factors:
To convert a factor to a non-factor:
data.frame()
… works as with matrices.
When you ask for just one column, the result is a vector:
Unless you set drop
to FALSE
:
You can select rows, too:
subset()
subset()
has three important parameters:
x
: the data frame to pick from;subset
: a boolean expression to pick rows;select
: the columns you want (default is all of them)Most people don’t name the first two arguments:
Height in feet, instead of inches:
You can add the new variable to the frame:
When you are interested only in whether or not a variable has a particular value, ifelse()
can help:
Use map_values()
in the plyr package:
seat3 <- plyr::mapvalues(
m111survey$seat,
from = c("1_front", "2_middle", "3_back"),
to = c("Front", "Middle", "Back")
)
str(seat3)
Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...
You can save this back to the data frame if you like:
cut()
heightClass <- cut(
m111survey$height,
breaks = c(-Inf, 65, 70, Inf),
labels = c("Short", "Medium","Tall"),
right = FALSE
)
str(heightClass)
Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 3 3 1 2 ...
right = FALSE
means, e.g, 65 is “Medium”, not “Short”right = TRUE
means, e.g., 65 is “Short”, not “Medium”If you don’t want a variable in the data frame, get rid of it like this: