Data Frames

(Sections 7.4 - 7.6)

Tidy Data

A Dataset

Companions on the Yellow Brick Road
name age desire likesDogs
Dorothy 12.00 Kansas TRUE
Scarecrow 0.02 brains TRUE
Tinman 20.00 heart FALSE
Lion 18.00 courage FALSE
Toto 6.00 kibbles NA

Tidy Data

A data set is tidy when:

  • Each row corresponds to a unique entity of the same sort (a case)
  • In each column the values record the same type of information about each case.

Definitions

Case

An individual unit under study. In a data frame in R, the rows correspond to cases.


Variable (in Data Analysis)

A measurement made on the individuals in a study.

Storing Data Sets in R

Can you use a matrix to store a data set?

No!

Matrices are atomic vectors: they can handle only one type of data at a time.

A data set can have many types of data.

Several Types of Data …

Companions on the Yellow Brick Road
name age desire likesDogs
Dorothy 12.00 Kansas TRUE
Scarecrow 0.02 brains TRUE
Tinman 20.00 heart FALSE
Lion 18.00 courage FALSE
Toto 6.00 kibbles NA

It Would be Great If …

… there were a non-atomic two dimensional data structure (so that not all columns have to be of the same data type).

Then:

  • name and desire could be character vectors
  • age could be double
  • likesToto could be logical

Data Frames

Definition of “Data Frame”

Data Frame

A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.

Example Data Frame: m111survey

It’s in the bcscr package:

library(bcscr)

In the R Studio IDE, we can get a look at it …

View(m111survey)

… and in R we can learn more about it:

?m111survey

Other Ways to See a Data Frame

Print it all out to the Console:

m111survey

But that’s unwieldy. To see just the first few rows, try:

head(m111survey, n = 4)
  height ideal_ht sleep fastest   weight_feel love_first extra_life     seat
1     76       78   9.5     119 1_underweight         no        yes  1_front
2     74       76   7.0     110 2_about_right         no        yes 2_middle
3     64       NA   9.0      85 2_about_right         no         no 2_middle
4     62       65   7.0     100 1_underweight         no         no  1_front
   GPA enough_Sleep    sex diff.ideal.act.
1 3.56           no   male               2
2 2.50           no   male               2
3 3.80           no female              NA
4 3.50           no female               3

The Structure of Data Frame

str(m111survey)
'data.frame':   71 obs. of  12 variables:
 $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
 $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
 $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
 $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
 $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
 $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
 $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
 $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
 $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
 $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
 $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

Get the Variables

names(m111survey)
 [1] "height"          "ideal_ht"        "sleep"           "fastest"        
 [5] "weight_feel"     "love_first"      "extra_life"      "seat"           
 [9] "GPA"             "enough_Sleep"    "sex"             "diff.ideal.act."

You can use a $ to isolate a variable in a frame:

m111survey$fastest

The result is a vector and may be treated as such:

m111survey$fastest[1:10]
 [1] 119 110  85 100  95 100  85 160  90  90

Computing with a Variable …

… is done as with any vector. For example, the mean fastest speed our subjects drove their cars:

mean(m111survey$fastest, na.rm = TRUE)
[1] 105.9014

Speeds at least 150 miles per hour:

m111survey$fastest[m111survey$fastest >= 150]
[1] 160 190

Factors

Factor Variables

Some of the variables in m111survey are factors.

Example: seat

str(m111survey$seat)
 Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...

Factors are examples of categorical variables.

Definition

Categorical Variable (in Data Analysis)

A variable whose values cannot be expressed meaningfully by numbers.

Usually a factor has only a small number of possible values, called levels.

levels(m111survey$seat)
[1] "1_front"  "2_middle" "3_back"  

Making a Factor

Start with a vector:

ozFavs <- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
            "Glinda", "Scarecrow", "Dorothy")

Make a factor from it:

factorFavs <- factor(ozFavs)
factorFavs
[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Dorothy Glinda Scarecrow Toto

Determining the Levels

You could set the order of the levels

factorFavs2 <- factor(ozFavs,
                      levels = c("Toto", "Scarecrow", "Glinda", "Dorothy"))
factorFavs2
[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Toto Scarecrow Glinda Dorothy

Caution

A categorical variable should not be made into a factor if it has a very large number of possible values.

Suppose your cases are people. The following are OK as factors:

  • sex (male, female, other)
  • favorite sport

Bad as factors:

  • street address
  • favorite quotation

Make it Not a Factor

To convert a factor to a non-factor:

as.character(factorFavs)
[1] "Glinda"    "Toto"      "Toto"      "Dorothy"   "Toto"      "Glinda"   
[7] "Scarecrow" "Dorothy"  

Creating Data Frames

Two Ways to Make a Data Frame

  1. Create it directly from vectors. (We will all learn this.)
  2. Bind two or more existing data frames together. (Optional, see “More in Depth” section.)

Direct Creation: data.frame()

n <- c("Dorothy", "Lion", "Scarecrow")
h <- c(58, 75, 69)
a <- c(12, 18, 0.04)
ozFolk <- data.frame(name = n, height = h, age = a)
ozFolk
       name height   age
1   Dorothy     58 12.00
2      Lion     75 18.00
3 Scarecrow     69  0.04

Subsetting with Data Frames

Subsetting with Data Frames …

… works as with matrices.

# all the rows, only two variables:
df <- m111survey[, c("height", "ideal_ht")]
head(df)
  height ideal_ht
1   76.0       78
2   74.0       76
3   64.0       NA
4   62.0       65
5   72.0       72
6   70.8       NA

“Dropping”

When you ask for just one column, the result is a vector:

df <- m111survey[, "height"]
str(df)
 num [1:71] 76 74 64 62 72 70.8 70 79 59 67 ...

Unless you set drop to FALSE:

df <- m111survey[, "height", drop = FALSE]
str(df)
'data.frame':   71 obs. of  1 variable:
 $ height: num  76 74 64 62 72 70.8 70 79 59 67 ...

Another Example

You can select rows, too:

# rows 10-15, only two columns:
m111survey[10:15, c("height", "ideal_ht")]
   height ideal_ht
10     67       67
11     65       69
12     62       62
13     59       62
14     78       75
15     69       72

Selecting Rows at Random

n <- nrow(m111survey)
# six random rows:
df <- m111survey[sample(1:n, size = 6, replace = FALSE), ]
df[c("sex", "seat")]  # show just two columns
      sex     seat
13 female  1_front
54   male 2_middle
56   male   3_back
28 female  1_front
53 female   3_back
46 female 2_middle

Subsetting With Boolean Expressions

# select only the rows where fastest is at least 150:
df <- m111survey[m111survey$fastest >= 150, ]
df[, c("sex", "fastest")]  # show just two of the variables
    sex fastest
8  male     160
32 male     190

Subsetting with subset()

subset() has three important parameters:

  • x: the data frame to pick from;
  • subset: a boolean expression to pick rows;
  • select: the columns you want (default is all of them)

Example

subset(
  x = m111survey, 
  subset = fastest >= 150,
  select = c("sex", "fastest")
  )
    sex fastest
8  male     160
32 male     190

Abbreviation

Most people don’t name the first two arguments:

subset(m111survey, fastest >= 150, select = c("sex", "fastest"))
    sex fastest
8  male     160
32 male     190

It Can Become Quite Complex!

df <- subset(
  m111survey,
  seat == "3_back" & height < 72 & sex == "female",
  select = c("sex", "height", "seat")
  )
df
      sex height   seat
9  female     59 3_back
20 female     65 3_back
30 female     69 3_back
53 female     69 3_back
70 female     65 3_back

New Variables from Old

Transforming a Variable

Height in feet, instead of inches:

heightInFeet <- m111survey$height / 12 # 12 inches in a foot

You can add the new variable to the frame:

m111survey$height_ft <- heightInFeet

Recoding

When you are interested only in whether or not a variable has a particular value, ifelse() can help:

seat2 <- ifelse(m111survey$seat == "3_back", "Back", "Other")
m111survey$seat2 <- seat2

Recoding (3 or More Values)

Use map_values() in the plyr package:

seat3 <- plyr::mapvalues(
  m111survey$seat,
  from = c("1_front", "2_middle", "3_back"),
  to = c("Front", "Middle", "Back")
)
str(seat3)
 Factor w/ 3 levels "Front","Middle",..: 1 2 2 1 3 1 1 3 3 2 ...

You can save this back to the data frame if you like:

m111survey$seat3 <- seat3

Numerical to Factor With cut()

heightClass <- cut(
  m111survey$height,
  breaks = c(-Inf, 65, 70, Inf),
  labels = c("Short", "Medium","Tall"),
  right = FALSE
)
str(heightClass)
 Factor w/ 3 levels "Short","Medium",..: 3 3 1 1 3 3 3 3 1 2 ...
  • right = FALSE means, e.g, 65 is “Medium”, not “Short”
  • right = TRUE means, e.g., 65 is “Short”, not “Medium”

Getting Rid of a Variable

If you don’t want a variable in the data frame, get rid of it like this:

m111survey$seat3 <- NULL