Data Frames

(Sections 7.4 - 7.6)

Tidy Data

A Dataset

Companions on the Yellow Brick Road
name	age	desire	likesDogs
Dorothy	12.00	Kansas	TRUE
Scarecrow	0.02	brains	TRUE
Tinman	20.00	heart	FALSE
Lion	18.00	courage	FALSE
Toto	6.00	kibbles	NA

Tidy Data

A data set is tidy when:

Each row corresponds to a unique entity of the same sort (a case)
In each column the values record the same type of information about each case.

Definitions

Case

An individual unit under study. In a data frame in R, the rows correspond to cases.

Variable (in Data Analysis)

A measurement made on the individuals in a study.

Storing Data Sets in R

Can you use a matrix to store a data set?

No!

Matrices are atomic vectors: they can handle only one type of data at a time.

A data set can have many types of data.

Several Types of Data …

Companions on the Yellow Brick Road
name	age	desire	likesDogs
Dorothy	12.00	Kansas	TRUE
Scarecrow	0.02	brains	TRUE
Tinman	20.00	heart	FALSE
Lion	18.00	courage	FALSE
Toto	6.00	kibbles	NA

It Would be Great If …

… there were a non-atomic two dimensional data structure (so that not all columns have to be of the same data type).

Then:

name and desire could be character vectors
age could be double
likesToto could be logical

Data Frames

Definition of “Data Frame”

Data Frame

A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.

Example Data Frame: `m111survey`

It’s in the bcscr package:

library(bcscr)

In the R Studio IDE, we can get a look at it …

View(m111survey)

… and in R we can learn more about it:

?m111survey

Other Ways to See a Data Frame

Print it all out to the Console:

m111survey

But that’s unwieldy. To see just the first few rows, try:

head(m111survey, n = 4)

  height ideal_ht sleep fastest   weight_feel love_first extra_life     seat
1     76       78   9.5     119 1_underweight         no        yes  1_front
2     74       76   7.0     110 2_about_right         no        yes 2_middle
3     64       NA   9.0      85 2_about_right         no         no 2_middle
4     62       65   7.0     100 1_underweight         no         no  1_front
   GPA enough_Sleep    sex diff.ideal.act.
1 3.56           no   male               2
2 2.50           no   male               2
3 3.80           no female              NA
4 3.50           no female               3

The Structure of Data Frame

str(m111survey)

'data.frame':   71 obs. of  12 variables:
 $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
 $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
 $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
 $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
 $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
 $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
 $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
 $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
 $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
 $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
 $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

Get the Variables

names(m111survey)

 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."

You can use a $ to isolate a variable in a frame:

m111survey$fastest

The result is a vector and may be treated as such:

m111survey$fastest[1:10]

 [1] 119 110  85 100  95 100  85 160  90  90

Computing with a Variable …

… is done as with any vector. For example, the mean fastest speed our subjects drove their cars:

mean(m111survey$fastest, na.rm = TRUE)

[1] 105.9014

Speeds at least 150 miles per hour:

m111survey$fastest[m111survey$fastest >= 150]

[1] 160 190

Factors

Factor Variables

Some of the variables in m111survey are factors.

Example: seat

str(m111survey$seat)

 Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...

Factors are examples of categorical variables.

Definition

Categorical Variable (in Data Analysis)

A variable whose values cannot be expressed meaningfully by numbers.

Usually a factor has only a small number of possible values, called levels.

levels(m111survey$seat)

[1] "1_front"  "2_middle" "3_back"

Making a Factor

Start with a vector:

ozFavs <- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
            "Glinda", "Scarecrow", "Dorothy")

Make a factor from it:

factorFavs <- factor(ozFavs)
factorFavs

[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Dorothy Glinda Scarecrow Toto

Determining the Levels

You could set the order of the levels

factorFavs2 <- factor(ozFavs,
                      levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))
factorFavs2

[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Toto Scarecrow Glinda Dorothy

Caution

A categorical variable should not be made into a factor if it has a very large number of possible values.

Suppose your cases are people. The following are OK as factors:

sex (male, female, other)
favorite sport

Bad as factors:

street address
favorite quotation

Make it Not a Factor

To convert a factor to a non-factor:

as.character(factorFavs)

[1] "Glinda"    "Toto"      "Toto"      "Dorothy"   "Toto"      "Glinda"   
[7] "Scarecrow" "Dorothy"

Creating Data Frames

Two Ways to Make a Data Frame

Create it directly from vectors. (We will all learn this.)
Bind two or more existing data frames together. (Optional, see “More in Depth” section.)

Direct Creation: `data.frame()`

n <- c("Dorothy", "Lion", "Scarecrow")
h <- c(58, 75, 69)
a <- c(12, 18, 0.04)
ozFolk <- data.frame(name = n, height = h, age = a)
ozFolk

       name height   age
1   Dorothy     58 12.00
2      Lion     75 18.00
3 Scarecrow     69  0.04

Subsetting with Data Frames

Subsetting with Data Frames …

… works as with matrices.

# all the rows, only two variables:
df <- m111survey[, c("height", "ideal_ht")]
head(df)

  height ideal_ht
1   76.0       78
2   74.0       76
3   64.0       NA
4   62.0       65
5   72.0       72
6   70.8       NA

“Dropping”

When you ask for just one column, the result is a vector:

df <- m111survey[, "height"]
str(df)

 num [1:71] 76 74 64 62 72 70.8 70 79 59 67 ...

Unless you set drop to FALSE:

df <- m111survey[, "height", drop = FALSE]
str(df)

'data.frame':   71 obs. of  1 variable:
 $ height: num  76 74 64 62 72 70.8 70 79 59 67 ...

Another Example

You can select rows, too:

# rows 10-15, only two columns:
m111survey[10:15, c("height", "ideal_ht")]

   height ideal_ht
10     67       67
11     65       69
12     62       62
13     59       62
14     78       75
15     69       72

Selecting Rows at Random

n <- nrow(m111survey)
# six random rows:
df <- m111survey[sample(1:n, size = 6, replace = FALSE), ]
df[c("sex", "seat")]  # show just two columns

      sex     seat
13 female  1_front
54   male 2_middle
56   male   3_back
28 female  1_front
53 female   3_back
46 female 2_middle

Subsetting With Boolean Expressions

# select only the rows where fastest is at least 150:
df <- m111survey[m111survey$fastest >= 150, ]
df[, c("sex", "fastest")]  # show just two of the variables

    sex fastest
8  male     160
32 male     190

Subsetting with `subset()`

subset() has three important parameters:

x: the data frame to pick from;
subset: a boolean expression to pick rows;
select: the columns you want (default is all of them)

Example

subset(
  x = m111survey, 
  subset = fastest >= 150,
  select = c("sex", "fastest")
  )

    sex fastest
8  male     160
32 male     190

Abbreviation

Most people don’t name the first two arguments:

subset(m111survey, fastest >= 150, select = c("sex", "fastest"))

    sex fastest
8  male     160
32 male     190

It Can Become Quite Complex!

df <- subset(
  m111survey,
  seat == "3_back" & height < 72 & sex == "female",
  select = c("sex", "height", "seat")
  )
df

      sex height   seat
9  female     59 3_back
20 female     65 3_back
30 female     69 3_back
53 female     69 3_back
70 female     65 3_back

New Variables from Old

Transforming a Variable

Height in feet, instead of inches:

heightInFeet <- m111survey$height / 12 # 12 inches in a foot

You can add the new variable to the frame:

m111survey$height_ft <- heightInFeet

Recoding

When you are interested only in whether or not a variable has a particular value, ifelse() can help:

seat2 <- ifelse(m111survey$seat == "3_back", "Back", "Other")
m111survey$seat2 <- seat2

Recoding (3 or More Values)

Use map_values() in the plyr package:

seat3 <- plyr::mapvalues(
  m111survey$seat,
  from = c("1_front", "2_middle", "3_back"),
  to = c("Front", "Middle", "Back")
)
str(seat3)

 Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...

You can save this back to the data frame if you like:

m111survey$seat3 <- seat3

Numerical to Factor With `cut()`

heightClass <- cut(
  m111survey$height,
  breaks = c(-Inf, 65, 70, Inf),
  labels = c("Short", "Medium","Tall"),
  right = FALSE
)
str(heightClass)

 Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 3 3 1 2 ...

right = FALSE means, e.g, 65 is “Medium”, not “Short”
right = TRUE means, e.g., 65 is “Short”, not “Medium”

Getting Rid of a Variable

If you don’t want a variable in the data frame, get rid of it like this:

m111survey$seat3 <- NULL

Data Frames

Tidy Data

A Dataset

Tidy Data

Definitions

Storing Data Sets in R

Several Types of Data …

It Would be Great If …

Data Frames

Definition of “Data Frame”

Example Data Frame: m111survey

Other Ways to See a Data Frame

The Structure of Data Frame

Get the Variables

Computing with a Variable …

Factors

Factor Variables

Definition

Making a Factor

Determining the Levels

Caution

Make it Not a Factor

Creating Data Frames

Two Ways to Make a Data Frame

Direct Creation: data.frame()

Subsetting with Data Frames

Subsetting with Data Frames …

“Dropping”

Another Example

Selecting Rows at Random

Subsetting With Boolean Expressions

Subsetting with subset()

Example

Abbreviation

It Can Become Quite Complex!

New Variables from Old

Transforming a Variable

Recoding

Recoding (3 or More Values)

Numerical to Factor With cut()

Getting Rid of a Variable

Example Data Frame: `m111survey`

Direct Creation: `data.frame()`

Subsetting with `subset()`

Numerical to Factor With `cut()`