Basic Tidyverse Concepts

(All Sections)

The `tidyverse` “Package”

Attach It

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The Pipe Operator

The Pipe

%>% is from the magrittr package. (Some of this package is imported into the tidyverse.)

Usage:

# same as rep("hello", times = 4)
"hello" %>% rep(times = 4)

[1] "hello" "hello" "hello" "hello"

The Idea of the Pipe

The value returned by the first call becomes the first argument of the second call.

Another example:

# same as nrow(bcscr::m111survey)
bcscr::m111survey %>% nrow()

[1] 71

Piping to Another Argument

Use the . to pipe to an argument other than the first one:

# same as rep("hello", times = 4)
4 %>% rep("hello", times = .)

[1] "hello" "hello" "hello" "hello"

Another example:

# same as seq(1, 100, by = 4)[3]
seq(1, 100, by = 4) %>% .[3]

[1] 9

Practice

Rewrite the following call with the pipe operator, in three different ways:

R’s Native Pipe

R Now Has Its “Own” Pipe

R’s base package now has its own pipe. It looks like this: |>.

The placeholder is indicated by the underscore (_) instead of a period.

Example

letters |>
## letters is passed as the first argument of rep:
  rep(times = 2)

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
[39] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

Another Example:

Yet Another Example

1:length(letters) |>
## the placeholder below passes the above vector
## to the second argument of rep:
  rep(letters, times = _)

  [1] "a" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e" "e" "f" "f" "f"
 [19] "f" "f" "f" "g" "g" "g" "g" "g" "g" "g" "h" "h" "h" "h" "h" "h" "h" "h"
 [37] "i" "i" "i" "i" "i" "i" "i" "i" "i" "j" "j" "j" "j" "j" "j" "j" "j" "j"
 [55] "j" "k" "k" "k" "k" "k" "k" "k" "k" "k" "k" "k" "l" "l" "l" "l" "l" "l"
 [73] "l" "l" "l" "l" "l" "l" "m" "m" "m" "m" "m" "m" "m" "m" "m" "m" "m" "m"
 [91] "m" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "o" "o" "o"
[109] "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "p" "p" "p" "p" "p" "p"
[127] "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "q" "q" "q" "q" "q" "q" "q" "q"
[145] "q" "q" "q" "q" "q" "q" "q" "q" "q" "r" "r" "r" "r" "r" "r" "r" "r" "r"
[163] "r" "r" "r" "r" "r" "r" "r" "r" "r" "s" "s" "s" "s" "s" "s" "s" "s" "s"
[181] "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "t" "t" "t" "t" "t" "t" "t" "t"
[199] "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "u" "u" "u" "u" "u" "u"
[217] "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "v" "v" "v"
[235] "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v"
[253] "v" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w"
[271] "w" "w" "w" "w" "w" "w" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x"
[289] "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "y" "y" "y" "y" "y" "y"
[307] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[325] "y" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z"
[343] "z" "z" "z" "z" "z" "z" "z" "z" "z"

But …

… we’ll mostly stick with %>%.

Tibbles

Data Frames

What sort of thing is bcscr::m111survey?

class(bcscr::m111survey)

[1] "data.frame"

Yep, it’s a data frame.

Tibbles

Tibbles are similar to data frames. You can always turn a data frame into a tibble:

survey <- as_tibble(bcscr::m111survey)
class(survey)

[1] "tbl_df"     "tbl"        "data.frame"

Conveniences

The printout is compact:

This is an advantage when you are dealing with large amount of data.

Subsetting with dplyr

The dplyr package is part of the tidyverse. It contains function to manipulate data sets.

Pick Out Rows With `filter()`

Pick Out Columns With `select()`

Chaining

filter() and select() are examples of data verbs: they operate on data tables.

A data verb:

takes a data table (data frame, tibble, etc.) as its first argument;
returns a data table.

Hence they may be composed easily.

Example of Chaining

Leaving Columns Out

Put - signs in front of the columns you don’t want:

Transforming Variables With `mutate()`

Classify Drivers

You’re a Daredevil if you go more than 125 mph:

More Than One Variable Transformed

Expand for code

survey %>% 
  mutate(dareDevil = fastest > 125,
         height_ft = height / 12) %>% 
  ggplot(aes(x = dareDevil, y = height_ft)) +
    geom_boxplot(fill = "burlywood", out.alpha = 0) +
    geom_jitter(width = 0.2) +
  labs(
    x = "Whether person drives more than 125 mph",
    y = "height (ft)",
    title = "Daredevils aren't any taller than cautious people!"
  )

Grouping and Summarizing

Back to CPS85

Access the CPS85 data:

data(CPS85, package = "mosaicData")

Mean wage for Each Sex

More Than One Summary

Five-Number Summary

# same as fivenum(CPS85$wage)
CPS85 %>% 
  .$wage %>% 
  fivenum()

[1]  1.00  5.25  7.78 11.25 44.50

Summary by Sex

Expand for code

CPS85 %>%
  group_by(sex) %>% 
  summarise(
    n = n(),
    min = fivenum(wage)[1],
    Q1 = fivenum(wage)[2],
    median = fivenum(wage)[3],
    Q3 = fivenum(wage)[4],
    max = fivenum(wage)[5]
  )

# A tibble: 2 × 7
  sex       n   min    Q1 median    Q3   max
  <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 F       245  1.75  4.75   6.8     10  44.5
2 M       289  1     6      8.93    13  26.3

Grouping by More Than One Variable

Expand for code

CPS85 %>% 
  group_by(sector, sex) %>% 
  summarise(
    n = n(),
    min = fivenum(wage)[1],
    Q1 = fivenum(wage)[2],
    median = fivenum(wage)[3],
    Q3 = fivenum(wage)[4],
    max = fivenum(wage)[5]
  )

# A tibble: 15 × 8
# Groups:   sector [8]
   sector   sex       n   min    Q1 median    Q3   max
   <fct>    <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
 1 clerical F        76  3     5.1    7     9.55 15.0 
 2 clerical M        21  3.35  6      7.69  9    12   
 3 const    M        20  3.75  7.15   9.75 11.8  15   
 4 manag    F        21  3.64  6.88  10    11.2  44.5 
 5 manag    M        34  1     8.8   14.0  18.2  26.3 
 6 manuf    F        24  3     4.36   4.9   6.05 18.5 
 7 manuf    M        44  3.35  6.58   8.94 11.2  22.2 
 8 other    F         6  3.75  4      5.62  6.88  8.93
 9 other    M        62  2.85  5.25   7.5  11.2  26   
10 prof     F        52  4.35  7.02  10    12.3  25.0 
11 prof     M        53  5     8     12    16.4  25.0 
12 sales    F        17  3.35  3.8    4.55  5.65 14.3 
13 sales    M        21  3.5   5.56   9.42 12.5  20.0 
14 service  F        49  1.75  3.75   5     8    13.1 
15 service  M        34  2.01  4.15   5.89  8.75 25

Saving Output

sexSector <-
  CPS85 %>% 
  group_by(sector, sex) %>% 
  summarise(
    n = n(),
    min = fivenum(wage)[1],
    Q1 = fivenum(wage)[2],
    median = fivenum(wage)[3],
    Q3 = fivenum(wage)[4],
    max = fivenum(wage)[5]
  )
class(sexSector)

[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

sexSector2 <-
  sexSector %>% 
  ungroup()
class(sexSector2)

[1] "tbl_df"     "tbl"        "data.frame"

Table-Display Tips

R Mardown / Quarto

In R Markdown or Quarto documents:

very small tables (one or two rows) can be printed as is;
medium tables (3-15 rows, roughly) could be displayed with knitr::kable();
larger tables could be displayed with DT::datatable().

There are many more fun options for table-display; this page has a list.

A Very Small Table

CPS85 %>% 
  group_by(sex) %>% 
  summarize(meanWage = mean(wage),
            n = n())

# A tibble: 2 × 3
  sex   meanWage     n
  <fct>    <dbl> <int>
1 F         7.88   245
2 M         9.99   289

A Medium-Sized Table

CPS85 %>% 
  group_by(sector) %>% 
  summarize(
    meanWage = mean(wage),
    n = n()
  ) %>% 
  knitr::kable(
    caption = "Mean Wage, by Sector"
  )

Mean Wage, by Sector
sector	meanWage	n
clerical	7.422577	97
const	9.502000	20
manag	12.704000	55
manuf	8.036029	68
other	8.500588	68
prof	11.947429	105
sales	7.592632	38
service	6.537470	83

Large Tables

CPS85 %>% 
  group_by(sector) %>% 
  slice_max(n = 10, wt = wage) %>% 
  arrange(sector, desc(wage)) %>% 
  DT::datatable(
    options = list(
      pageLength = 5,
      scrollX = TRUE
    )
  )

Here is the Table

DT Options

You can control many things:

CPS85 %>% 
  group_by(sector) %>% 
  top_n(n = 10, wt = wage) %>% 
  arrange(sector, -wage) %>% 
  DT::datatable(
    options = list(
      pageLength = 5,
      lengthMenu = c(5, 10, 15, 20),
      scrollX = TRUE
      )
    )

See here to learn more.

Fun Option: `reactable()`

Expand for code

library(reactable)
CPS85 %>% 
  group_by(sector, sex) %>% 
  summarize(
    n = n(),
    median_wage = median(wage)
  ) %>% 
  reactable(
    groupBy = "sector",
    resizable = TRUE,
    pagination = FALSE,
    highlight = TRUE,
    height = 400
  )

Learn more about reactable here.

Fun Option: `gt()`

library(gt)
CPS85 %>% 
  group_by(sector, sex) %>% 
  summarize(
    n = n(),
    median_wage = median(wage)
  ) %>% 
  filter(
    sector %in% c(
      "manag", "manuf", "prof"
    )
  ) %>% 
  gt() %>%
    tab_header(
    title = "Wage by Sector and Sex",
    subtitle = "Current Population Survey, 1985"
  ) %>%
  fmt_currency(
    columns = c(median_wage),
    currency = "USD"
  )

Learn more about gt here.

Wage by Sector and Sex
Three sectors are shown.
sex	n	median_wage
manag
F	21	$10.00
M	34	$13.99
manuf
F	24	$4.90
M	44	$8.95
prof
F	52	$10.00
M	53	$12.00
From table `CPS85` in package mosaicData

Basic Tidyverse Concepts

The tidyverse “Package”

Attach It

The Pipe Operator

The Pipe

The Idea of the Pipe

Piping to Another Argument

Practice

R’s Native Pipe

R Now Has Its “Own” Pipe

Example

Another Example:

Yet Another Example

But …

Tibbles

Data Frames

Tibbles

Conveniences

Subsetting with dplyr

Pick Out Rows With filter()

Pick Out Columns With select()

Chaining

Example of Chaining

Leaving Columns Out

Transforming Variables With mutate()

Classify Drivers

More Than One Variable Transformed

Grouping and Summarizing

Back to CPS85

Mean wage for Each Sex

More Than One Summary

Five-Number Summary

Summary by Sex

Grouping by More Than One Variable

Saving Output

Table-Display Tips

R Mardown / Quarto

A Very Small Table

A Medium-Sized Table

Large Tables

Here is the Table

DT Options

Fun Option: reactable()

Fun Option: gt()

The `tidyverse` “Package”

Pick Out Rows With `filter()`

Pick Out Columns With `select()`

Transforming Variables With `mutate()`

Fun Option: `reactable()`

Fun Option: `gt()`