| name | age | desire | likesDogs |
|---|---|---|---|
| Dorothy | 12.00 | Kansas | TRUE |
| Scarecrow | 0.02 | brains | TRUE |
| Tinman | 20.00 | heart | FALSE |
| Lion | 18.00 | courage | FALSE |
| Toto | 6.00 | kibbles | NA |
(Sections 7.4 - 7.6)
| name | age | desire | likesDogs |
|---|---|---|---|
| Dorothy | 12.00 | Kansas | TRUE |
| Scarecrow | 0.02 | brains | TRUE |
| Tinman | 20.00 | heart | FALSE |
| Lion | 18.00 | courage | FALSE |
| Toto | 6.00 | kibbles | NA |
A data set is tidy when:
Case
An individual unit under study. In a data frame in R, the rows correspond to cases.Variable (in Data Analysis)
A measurement made on the individuals in a study.
Can you use a matrix to store a data set?
No!
Matrices are atomic vectors: they can handle only one type of data at a time.
A data set can have many types of data.
| name | age | desire | likesDogs |
|---|---|---|---|
| Dorothy | 12.00 | Kansas | TRUE |
| Scarecrow | 0.02 | brains | TRUE |
| Tinman | 20.00 | heart | FALSE |
| Lion | 18.00 | courage | FALSE |
| Toto | 6.00 | kibbles | NA |
… there were a non-atomic two dimensional data structure (so that not all columns have to be of the same data type).
Then:
name and desire could be character vectorsage could be doublelikesToto could be logicalData Frame
A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.
m111surveyIt’s in the bcscr package:
In the R Studio IDE, we can get a look at it …
… and in R we can learn more about it:
Print it all out to the Console:
But that’s unwieldy. To see just the first few rows, try:
You can use a $ to isolate a variable in a frame:
The result is a vector and may be treated as such:
… is done as with any vector. For example, the mean fastest speed our subjects drove their cars:
Speeds at least 150 miles per hour:
Some of the variables in m111survey are factors.
Example: seat
Factors are examples of categorical variables.
Categorical Variable (in Data Analysis)
A variable whose values cannot be expressed meaningfully by numbers.
Usually a factor has only a small number of possible values, called levels.
Start with a vector:
Make a factor from it:
You could set the order of the levels
A categorical variable should not be made into a factor if it has a very large number of possible values.
Suppose your cases are people. The following are OK as factors:
Bad as factors:
To convert a factor to a non-factor:
data.frame()… works as with matrices.
When you ask for just one column, the result is a vector:
Unless you set drop to FALSE:
You can select rows, too:
subset()subset() has three important parameters:
x: the data frame to pick from;subset: a boolean expression to pick rows;select: the columns you want (default is all of them)Most people don’t name the first two arguments:
Height in feet, instead of inches:
You can add the new variable to the frame:
When you are interested only in whether or not a variable has a particular value, ifelse() can help:
Use map_values() in the plyr package:
You can save this back to the data frame if you like:
cut()right = FALSE means, e.g, 65 is “Medium”, not “Short”right = TRUE means, e.g., 65 is “Short”, not “Medium”If you don’t want a variable in the data frame, get rid of it like this: