Data Frames

(Sections 7.4 - 7.6)

Tidy Data

A Dataset

Companions on the Yellow Brick Road
name age desire likesDogs
Dorothy 12.00 Kansas TRUE
Scarecrow 0.02 brains TRUE
Tinman 20.00 heart FALSE
Lion 18.00 courage FALSE
Toto 6.00 kibbles NA

Tidy Data

A data set is tidy when:

  • Each row corresponds to a unique entity of the same sort (a case)
  • In each column the values record the same type of information about each case.

Definitions

Case

An individual unit under study. In a data frame in R, the rows correspond to cases.


Variable (in Data Analysis)

A measurement made on the individuals in a study.

Storing Data Sets in R

Can you use a matrix to store a data set?

No!

Matrices are atomic vectors: they can handle only one type of data at a time.

A data set can have many types of data.

Several Types of Data …

Companions on the Yellow Brick Road
name age desire likesDogs
Dorothy 12.00 Kansas TRUE
Scarecrow 0.02 brains TRUE
Tinman 20.00 heart FALSE
Lion 18.00 courage FALSE
Toto 6.00 kibbles NA

It Would be Great If …

… there were a non-atomic two dimensional data structure (so that not all columns have to be of the same data type).

Then:

  • name and desire could be character vectors
  • age could be double
  • likesToto could be logical

Data Frames

Definition of “Data Frame”

Data Frame

A two-dimensional data structure in R in which the columns are atomic vectors that can be of different types.

Example Data Frame: m111survey

It’s in the bcscr package:

library(bcscr)

In the R Studio IDE, we can get a look at it …

View(m111survey)

… and in R we can learn more about it:

Other Ways to See a Data Frame

Print it all out to the Console:

But that’s unwieldy. To see just the first few rows, try:

The Structure of Data Frame

Get the Variables

You can use a $ to isolate a variable in a frame:

The result is a vector and may be treated as such:

Computing with a Variable …

… is done as with any vector. For example, the mean fastest speed our subjects drove their cars:

Speeds at least 150 miles per hour:

Factors

Factor Variables

Some of the variables in m111survey are factors.

Example: seat

Factors are examples of categorical variables.

Definition

Categorical Variable (in Data Analysis)

A variable whose values cannot be expressed meaningfully by numbers.

Usually a factor has only a small number of possible values, called levels.

Making a Factor

Start with a vector:

ozFavs <- c("Glinda", "Toto", "Toto", "Dorothy", "Toto",
            "Glinda", "Scarecrow", "Dorothy")

Make a factor from it:

factorFavs <- factor(ozFavs)
factorFavs
[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Dorothy Glinda Scarecrow Toto

Determining the Levels

You could set the order of the levels

factorFavs2 <- factor(
  ozFavs,
  levels = c("Toto", "Scarecrow", "Glinda", "Dorothy")
)
factorFavs2
[1] Glinda    Toto      Toto      Dorothy   Toto      Glinda    Scarecrow
[8] Dorothy  
Levels: Toto Scarecrow Glinda Dorothy

Caution

A categorical variable should not be made into a factor if it has a very large number of possible values.

Suppose your cases are people. The following are OK as factors:

  • sex (male, female, other)
  • favorite sport

Bad as factors:

  • street address
  • favorite quotation

Make it Not a Factor

To convert a factor to a non-factor:

as.character(factorFavs)
[1] "Glinda"    "Toto"      "Toto"      "Dorothy"   "Toto"      "Glinda"   
[7] "Scarecrow" "Dorothy"  

Creating Data Frames

Two Ways to Make a Data Frame

  1. Create it directly from vectors. (We will all learn this.)
  2. Bind two or more existing data frames together. (Optional, see “More in Depth” section.)

Direct Creation: data.frame()

Subsetting with Data Frames

Subsetting with Data Frames …

… works as with matrices.

“Dropping”

When you ask for just one column, the result is a vector:

Unless you set drop to FALSE:

Another Example

You can select rows, too:

Selecting Rows at Random

Subsetting With Boolean Expressions

Subsetting with subset()

subset() has three important parameters:

  • x: the data frame to pick from;
  • subset: a boolean expression to pick rows;
  • select: the columns you want (default is all of them)

Example

Abbreviation

Most people don’t name the first two arguments:

It Can Become Quite Complex!

New Variables from Old

Transforming a Variable

Height in feet, instead of inches:

You can add the new variable to the frame:

Recoding

When you are interested only in whether or not a variable has a particular value, ifelse() can help:

Recoding (3 or More Values)

Use map_values() in the plyr package:

You can save this back to the data frame if you like:

Numerical to Factor With cut()

  • right = FALSE means, e.g, 65 is “Medium”, not “Short”
  • right = TRUE means, e.g., 65 is “Short”, not “Medium”

Getting Rid of a Variable

If you don’t want a variable in the data frame, get rid of it like this:

survey$seat3 <- NULL