Basic Tidyverse Concepts

(All Sections)

The tidyverse “Package”

Attach It

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<>) to force all conflicts to become errors

The Pipe Operator

The Pipe

%>% is from the magrittr package. (Some of this package is imported into the tidyverse.)


# same as rep("hello", times = 4)
"hello" %>% rep(times = 4)
[1] "hello" "hello" "hello" "hello"

The Idea of the Pipe

The value returned by the first call becomes the first argument of the second call.

Another example:

# same as nrow(bcscr::m111survey)
bcscr::m111survey %>% nrow()
[1] 71

Piping to Another Argument

Use the . to pipe to an argument other than the first one:

# same as rep("hello", times = 4)
4 %>% rep("hello", times = .)
[1] "hello" "hello" "hello" "hello"

Another example:

# same as seq(1, 100, by = 4)[3]
seq(1, 100, by = 4) %>% .[3]
[1] 9


Rewrite the following call with the pipe operator, in three different ways:

seq(2, 22, by = 4)
[1]  2  6 10 14 18 22

R’s Native Pipe

R Now Has Its “Own” Pipe

R’s base package now has its own pipe. It looks like this: |>.

The placeholder is indicated by the underscore (_) instead of a period.


letters |>
## letters is passed as the first argument of rep:
  rep(times = 2)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
[39] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

Another Example:

letters |>
## extract the first five elements of letters:
  _[1:5] |>
#3 pass these as the first argument of rep:
  rep(times = 1:5)
 [1] "a" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e" "e"

Yet Another Example

1:length(letters) |>
## the placeholder below passes the above vector
## to the second argument of rep:
  rep(letters, times = _)
  [1] "a" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e" "e" "f" "f" "f"
 [19] "f" "f" "f" "g" "g" "g" "g" "g" "g" "g" "h" "h" "h" "h" "h" "h" "h" "h"
 [37] "i" "i" "i" "i" "i" "i" "i" "i" "i" "j" "j" "j" "j" "j" "j" "j" "j" "j"
 [55] "j" "k" "k" "k" "k" "k" "k" "k" "k" "k" "k" "k" "l" "l" "l" "l" "l" "l"
 [73] "l" "l" "l" "l" "l" "l" "m" "m" "m" "m" "m" "m" "m" "m" "m" "m" "m" "m"
 [91] "m" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "o" "o" "o"
[109] "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "o" "p" "p" "p" "p" "p" "p"
[127] "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "q" "q" "q" "q" "q" "q" "q" "q"
[145] "q" "q" "q" "q" "q" "q" "q" "q" "q" "r" "r" "r" "r" "r" "r" "r" "r" "r"
[163] "r" "r" "r" "r" "r" "r" "r" "r" "r" "s" "s" "s" "s" "s" "s" "s" "s" "s"
[181] "s" "s" "s" "s" "s" "s" "s" "s" "s" "s" "t" "t" "t" "t" "t" "t" "t" "t"
[199] "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "t" "u" "u" "u" "u" "u" "u"
[217] "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "u" "v" "v" "v"
[235] "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v" "v"
[253] "v" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w" "w"
[271] "w" "w" "w" "w" "w" "w" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x"
[289] "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "x" "y" "y" "y" "y" "y" "y"
[307] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[325] "y" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z" "z"
[343] "z" "z" "z" "z" "z" "z" "z" "z" "z"

But …

… we’ll mostly stick with %>%.


Data Frames

What sort of thing is bcscr::m111survey?

[1] "data.frame"

Yep, it’s a data frame.


Tibbles are similar to data frames. You can always turn a data frame into a tibble:

survey <- as_tibble(bcscr::m111survey)
[1] "tbl_df"     "tbl"        "data.frame"


The printout is compact:

# A tibble: 71 × 12
   height ideal_ht sleep fastest weight_feel   love_first extra_life seat    GPA
    <dbl>    <dbl> <dbl>   <int> <fct>         <fct>      <fct>      <fct> <dbl>
 1   76         78   9.5     119 1_underweight no         yes        1_fr…  3.56
 2   74         76   7       110 2_about_right no         yes        2_mi…  2.5 
 3   64         NA   9        85 2_about_right no         no         2_mi…  3.8 
 4   62         65   7       100 1_underweight no         no         1_fr…  3.5 
 5   72         72   8        95 1_underweight no         yes        3_ba…  3.2 
 6   70.8       NA  10       100 3_overweight  no         no         1_fr…  3.1 
 7   70         72   4        85 2_about_right no         yes        1_fr…  3.68
 8   79         76   6       160 2_about_right no         yes        3_ba…  2.7 
 9   59         61   7        90 2_about_right no         yes        3_ba…  2.8 
10   67         67   7        90 3_overweight  no         no         2_mi… NA   
# ℹ 61 more rows
# ℹ 3 more variables: enough_Sleep <fct>, sex <fct>, diff.ideal.act. <dbl>

This is an advantage when you are dealing with large amount of data.

Subsetting with dplyr

The dplyr package is part of the tidyverse. It contains function to manipulate data sets.

Pick Out Rows With filter()

survey %>%
  filter((sex == "male" & height > 70) | (sex =="female" & height < 55))
# A tibble: 22 × 12
   height ideal_ht sleep fastest weight_feel   love_first extra_life seat    GPA
    <dbl>    <dbl> <dbl>   <int> <fct>         <fct>      <fct>      <fct> <dbl>
 1   76         78   9.5     119 1_underweight no         yes        1_fr…  3.56
 2   74         76   7       110 2_about_right no         yes        2_mi…  2.5 
 3   72         72   8        95 1_underweight no         yes        3_ba…  3.2 
 4   70.8       NA  10       100 3_overweight  no         no         1_fr…  3.1 
 5   79         76   6       160 2_about_right no         yes        3_ba…  2.7 
 6   73         77   6       110 2_about_right yes        yes        3_ba…  3.5 
 7   73         75   8       120 2_about_right no         yes        2_mi…  3.55
 8   54         54   4       130 3_overweight  yes        yes        1_fr…  3.41
 9   74         75   5       119 2_about_right yes        yes        1_fr…  3.7 
10   72         90   9       125 3_overweight  no         yes        3_ba…  2.2 
# ℹ 12 more rows
# ℹ 3 more variables: enough_Sleep <fct>, sex <fct>, diff.ideal.act. <dbl>

Pick Out Columns With select()

survey %>%
  select(sex, height, fastest)
# A tibble: 71 × 3
   sex    height fastest
   <fct>   <dbl>   <int>
 1 male     76       119
 2 male     74       110
 3 female   64        85
 4 female   62       100
 5 male     72        95
 6 male     70.8     100
 7 male     70        85
 8 male     79       160
 9 female   59        90
10 female   67        90
# ℹ 61 more rows


filter() and select() are examples of data verbs: they operate on data tables.

A data verb:

  • takes a data table (data frame, tibble, etc.) as its first argument;
  • returns a data table.

Hence they may be composed easily.

Example of Chaining

survey %>% 
  filter((sex == "male" & height > 70) | (sex =="female" & height < 55)) %>% 
  select(sex, height, fastest)
# A tibble: 22 × 3
   sex    height fastest
   <fct>   <dbl>   <int>
 1 male     76       119
 2 male     74       110
 3 male     72        95
 4 male     70.8     100
 5 male     79       160
 6 male     73       110
 7 male     73       120
 8 female   54       130
 9 male     74       119
10 male     72       125
# ℹ 12 more rows

Leaving Columns Out

Put - signs in front of the columns you don’t want:

survey %>% 
  select(-ideal_ht, -love_first)
# A tibble: 71 × 10
   height sleep fastest weight_feel   extra_life seat     GPA enough_Sleep sex  
    <dbl> <dbl>   <int> <fct>         <fct>      <fct>  <dbl> <fct>        <fct>
 1   76     9.5     119 1_underweight yes        1_fro…  3.56 no           male 
 2   74     7       110 2_about_right yes        2_mid…  2.5  no           male 
 3   64     9        85 2_about_right no         2_mid…  3.8  no           fema…
 4   62     7       100 1_underweight no         1_fro…  3.5  no           fema…
 5   72     8        95 1_underweight yes        3_back  3.2  no           male 
 6   70.8  10       100 3_overweight  no         1_fro…  3.1  yes          male 
 7   70     4        85 2_about_right yes        1_fro…  3.68 no           male 
 8   79     6       160 2_about_right yes        3_back  2.7  yes          male 
 9   59     7        90 2_about_right yes        3_back  2.8  no           fema…
10   67     7        90 3_overweight  no         2_mid… NA    yes          fema…
# ℹ 61 more rows
# ℹ 1 more variable: diff.ideal.act. <dbl>

Transforming Variables With mutate()

Classify Drivers

You’re a Daredevil if you go more than 125 mph:

survey %>% 
  mutate(dareDevil = fastest > 125) %>%
  select(sex, fastest, dareDevil)
# A tibble: 71 × 3
   sex    fastest dareDevil
   <fct>    <int> <lgl>    
 1 male       119 FALSE    
 2 male       110 FALSE    
 3 female      85 FALSE    
 4 female     100 FALSE    
 5 male        95 FALSE    
 6 male       100 FALSE    
 7 male        85 FALSE    
 8 male       160 TRUE     
 9 female      90 FALSE    
10 female      90 FALSE    
# ℹ 61 more rows

More Than One Variable Transformed

Expand for code
survey %>% 
  mutate(dareDevil = fastest > 125,
         height_ft = height / 12) %>% 
  ggplot(aes(x = dareDevil, y = height_ft)) +
    geom_boxplot(fill = "burlywood", out.alpha = 0) +
    geom_jitter(width = 0.2) +
    x = "Whether person drives more than 125 mph",
    y = "height (ft)",
    title = "Daredevils aren't any taller than cautious people!"

Grouping and Summarizing

Back to CPS85

Access the CPS85 data:

data(CPS85, package = "mosaicData")

Mean wage for Each Sex

CPS85 %>% 
  group_by(sex) %>% 
  summarize(meanWage = mean(wage))
# A tibble: 2 × 2
  sex   meanWage
  <fct>    <dbl>
1 F         7.88
2 M         9.99

More Than One Summary

CPS85 %>% 
  group_by(sex) %>% 
  summarize(meanWage = mean(wage),
            n = n())
# A tibble: 2 × 3
  sex   meanWage     n
  <fct>    <dbl> <int>
1 F         7.88   245
2 M         9.99   289

Five-Number Summary

# same as fivenum(CPS85$wage)
CPS85 %>% 
  .$wage %>% 
[1]  1.00  5.25  7.78 11.25 44.50

Summary by Sex

Expand for code
CPS85 %>%
  group_by(sex) %>% 
    n = n(),
    min = fivenum(wage)[1],
    Q1 = fivenum(wage)[2],
    median = fivenum(wage)[3],
    Q3 = fivenum(wage)[4],
    max = fivenum(wage)[5]
# A tibble: 2 × 7
  sex       n   min    Q1 median    Q3   max
  <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 F       245  1.75  4.75   6.8     10  44.5
2 M       289  1     6      8.93    13  26.3

Grouping by More Than One Variable

Expand for code
CPS85 %>% 
  group_by(sector, sex) %>% 
    n = n(),
    min = fivenum(wage)[1],
    Q1 = fivenum(wage)[2],
    median = fivenum(wage)[3],
    Q3 = fivenum(wage)[4],
    max = fivenum(wage)[5]
# A tibble: 15 × 8
# Groups:   sector [8]
   sector   sex       n   min    Q1 median    Q3   max
   <fct>    <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
 1 clerical F        76  3     5.1    7     9.55 15.0 
 2 clerical M        21  3.35  6      7.69  9    12   
 3 const    M        20  3.75  7.15   9.75 11.8  15   
 4 manag    F        21  3.64  6.88  10    11.2  44.5 
 5 manag    M        34  1     8.8   14.0  18.2  26.3 
 6 manuf    F        24  3     4.36   4.9   6.05 18.5 
 7 manuf    M        44  3.35  6.58   8.94 11.2  22.2 
 8 other    F         6  3.75  4      5.62  6.88  8.93
 9 other    M        62  2.85  5.25   7.5  11.2  26   
10 prof     F        52  4.35  7.02  10    12.3  25.0 
11 prof     M        53  5     8     12    16.4  25.0 
12 sales    F        17  3.35  3.8    4.55  5.65 14.3 
13 sales    M        21  3.5   5.56   9.42 12.5  20.0 
14 service  F        49  1.75  3.75   5     8    13.1 
15 service  M        34  2.01  4.15   5.89  8.75 25   

Saving Output

sexSector <-
  CPS85 %>% 
  group_by(sector, sex) %>% 
    n = n(),
    min = fivenum(wage)[1],
    Q1 = fivenum(wage)[2],
    median = fivenum(wage)[3],
    Q3 = fivenum(wage)[4],
    max = fivenum(wage)[5]
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
sexSector2 <-
  sexSector %>% 
[1] "tbl_df"     "tbl"        "data.frame"

Table-Display Tips

R Mardown / Quarto

In R Markdown or Quarto documents:

  • very small tables (one or two rows) can be printed as is;
  • medium tables (3-15 rows, roughly) could be displayed with knitr::kable();
  • larger tables could be displayed with DT::datatable().

There are many more fun options for table-display; this page has a list.

A Very Small Table

CPS85 %>% 
  group_by(sex) %>% 
  summarize(meanWage = mean(wage),
            n = n())
# A tibble: 2 × 3
  sex   meanWage     n
  <fct>    <dbl> <int>
1 F         7.88   245
2 M         9.99   289

A Medium-Sized Table

CPS85 %>% 
  group_by(sector) %>% 
    meanWage = mean(wage),
    n = n()
  ) %>% 
    caption = "Mean Wage, by Sector"
Mean Wage, by Sector
sector meanWage n
clerical 7.422577 97
const 9.502000 20
manag 12.704000 55
manuf 8.036029 68
other 8.500588 68
prof 11.947429 105
sales 7.592632 38
service 6.537470 83

Mean Wage, by Sector
sector meanWage n
clerical 7.422577 97
const 9.502000 20
manag 12.704000 55
manuf 8.036029 68
other 8.500588 68
prof 11.947429 105
sales 7.592632 38
service 6.537470 83

Large Tables

CPS85 %>% 
  group_by(sector) %>% 
  slice_max(n = 10, wt = wage) %>% 
  arrange(sector, desc(wage)) %>% 
    options = list(
      pageLength = 5,
      scrollX = TRUE

Here is the Table

DT Options

You can control many things:

CPS85 %>% 
  group_by(sector) %>% 
  top_n(n = 10, wt = wage) %>% 
  arrange(sector, -wage) %>% 
    options = list(
      pageLength = 5,
      lengthMenu = c(5, 10, 15, 20),
      scrollX = TRUE

See here to learn more.

Fun Option: reactable()

Expand for code
CPS85 %>% 
  group_by(sector, sex) %>% 
    n = n(),
    median_wage = median(wage)
  ) %>% 
    groupBy = "sector",
    resizable = TRUE,
    pagination = FALSE,
    highlight = TRUE,
    height = 400

Learn more about reactable here.

Fun Option: gt()

CPS85 %>% 
  group_by(sector, sex) %>% 
    n = n(),
    median_wage = median(wage)
  ) %>% 
    sector %in% c(
      "manag", "manuf", "prof"
  ) %>% 
  gt() %>%
    title = "Wage by Sector and Sex",
    subtitle = "Current Population Survey, 1985"
  ) %>%
    columns = c(median_wage),
    currency = "USD"

Learn more about gt here.

Wage by Sector and Sex
Three sectors are shown.
sex n median_wage
F 21 $10.00
M 34 $13.99
F 24 $4.90
M 44 $8.95
F 52 $10.00
M 53 $12.00
From table CPS85 in package mosaicData