Regular Expressions in R

(Section 12.4)

Packages

Make sure these are attached:

library(bcscr)
library(tidyverse)

String to Regex

Example: Splitting on a Space

"hello there Mary Poppins" %>% 
  str_split(pattern = " ") %>% 
  unlist()

[1] "hello"   "there"   "Mary"    "Poppins"

The pattern parameter in str_*() functions takes a string that is converted to a regular expression by R.

Example: Splitting on Whitespace

myString <- "hello\t\tthere\n\nMary  \t Poppins"
cat(myString)

hello       there

Mary     Poppins

Try to split on the whitespace:

myString %>% 
  str_split(pattern = "\s+") %>% 
  unlist()

Error: '\s' is an unrecognized escape in character string (<input>:2:25)

The Problem

Before R can convert pattern to a regex, it has to understand pattern for what it is: a string!

But \s is not a control character!

Solution

So to get a literal \s, we must escape the \:

myString %>% 
  str_split(pattern = "\\s+") %>% 
  unlist()

[1] "hello"   "there"   "Mary"    "Poppins"

Moral: Always escape again when the regex you intend uses a backslash for escaping!

String-to-Regex Examples

Examples of entry of regular expressions as strings, in R.
Regular Expression	Entered as String
\s+	“\\s+”
find\.dot	“find\\.dot”
^\w*\d{1,3}$	“^\\w*\\d{1,3}$”

Substitution

Example: Messy Dates

Suppose that we have a vector of dates:

dates <- c(
  "3 - 14 - 1963", "4/13/ 2005",
  "12-1-1997", "11 / 11 / 1918"
)
dates

[1] "3 - 14 - 1963"  "4/13/ 2005"     "12-1-1997"      "11 / 11 / 1918"

We would prefer them to be in just ONE format!

Solution: `str_replace_all()`

dates %>% 
  str_replace_all(
    pattern = "[- /]+",
    replacement = "/"
)

[1] "3/14/1963"  "4/13/2005"  "12/1/1997"  "11/11/1918"

Parameters:

x (not written out here, due to the piping) is the text in which the substitution occurs;
pattern is the regex for the type of sub-string we want to replace;
replacment is what we want to replace matches of the pattern with.

Aside: `str_replace()`

dates %>% 
  str_replace(
    pattern = "[- /]+",
    replacement = "/"
)

[1] "3/14 - 1963"  "4/13/ 2005"   "12/1-1997"    "11/11 / 1918"

NOT what we need, here.
But if you KNOW there is only one match, use it! (It’s faster because it doesn’t have to search the entire string.)

Patterned Replacement

Task: write a function called doubleVowels() that replaces every vowel with its double:

Example of use:

doubleVowels("Far and away the best!")

[1] "Faar aand aawaay thee beest!"

Solution

doubleVowels <- function(str) {
  str %>% 
    str_replace_all(
      pattern = "([aeiou])", 
      replacement = "\\1\\1"
    )
}

The replacement argument can include regex features that refer to elements of the pattern!

Replacement Via Function

Capitalize every vowel:

capVowels <- function(str) {
  str %>% 
    str_replace_all(
      pattern = "[aeiou]", 
      replacement = function(x) str_to_upper(x)
    )
}
capVowels("Far and away the best!")

[1] "FAr And AwAy thE bEst!"

Replacement Via Function (Again)

Put asterisks around every word-repetition:

starRepeats <- function(str) {
  str %>% 
    str_replace_all(
      pattern = "\\b(\\w+) \\1\\b",
      replacement = function(x) {
        str_c("*", x, "*")
      }
    )
}
starRepeats("I have a boo boo on my knee knee.")

[1] "I have a *boo boo* on my *knee knee*."

Detecting Matches

Getting the Strings That Contain a Match

Some sentences:

sentences <- c(
  "My name is Tom, Sir",
  "And I'm Tiny Tulip!",
  "Whereas my name is Lester."
)

Select all and only the strings that contain a word beginning with capital T.

Knowing WHEN there is a Match

My name is Tom, Sir
And I'm Tiny Tulip!
Whereas my name is Lester.

If you just need to know whether or not there is a match:

Knowing WHERE There is a Match

My name is Tom, Sir
And I'm Tiny Tulip!
Whereas my name is Lester.

Knowing WHERE (All)

My name is Tom, Sir
And I'm Tiny Tulip!
Whereas my name is Lester.

Extracting Matches

Example

Extract pairs of words beginning with the same letter in sentences2 defined below:

sentences2 <- c(
  "The big bad wolf is walking warily to the cottage.",
  "He huffs and he puffs peevishly.",
  "He wears gnarly gargantuan bell bottoms!"
)

Extract ALL the Matches in Each Sentence

The big bad wolf is walking warily to the cottage.
He huffs and he puffs peevishly.
He wears gnarly gargantuan bell bottoms!

More Info With `str_match()`:

The big bad wolf is walking warily to the cottage.
He huffs and he puffs peevishly.
He wears gnarly gargantuan bell bottoms!

First column gives the entire match.
Second column gives the value of the \\1 capture-group.

Even More Info

The big bad wolf is walking warily to the cottage.
He huffs and he puffs peevishly.
He wears gnarly gargantuan bell bottoms!

Extraction in Data Frames

Recall our motivating example (from package bcscr):

NamePhone

                 name         phone
1     Philson, Mickey  580-789-5775
2      Shiner, Marget (206)948-8169
3   Sackrider, Dionne (432)297-3683
4    Kukowski, Isobel (240)619-8432
5     Isenhour, Garth    6417823425
6       Kapinos, Enid    6018723027
7    Blaker, Theodore (510)812-9092
8   Crossett, Rosaura    6063292954
9     Northern, Willy  551-427-1399
10    Goettl, Latonia (303)242-6982
11    Campagna, Ryann  727-692-1835
12         Wash, Mira  509-216-3598
13  Flansburg, Louann    3049163908
14  Winborne, Angella (678)249-9107
15    Arledge, Marcia (430)625-4239
16    Cookson, Eladia  507-588-4874
17        Tisher, Dee  470-439-4114
18    Difiore, Tyrell    4055294829
19     Colas, Tristan    7857923661
20      Sprenger, Ava (217)343-9603
21    Getman, Jesenia (646)812-6606
22      Starr, Ashley (281)514-6984
23     Raney, Irmgard  573-586-5935
24     Bryson, Dionna (325)627-2149
25        Welk, Bruno    2894782665
26        Dias, Petra    2694361985
27    Alejandro, Nana  254-563-7229
28      Sanson, Jason (469)453-3600
29    Ellerbe, Gracia  320-749-5706
30     Parris, Julius  630-537-5563
31 Tomasello, Rachele (240)696-2942
32  Tackitt, Mireille    3312028129
33 Taliaferro, Kaycee    7622728177
34  Imperato, Natalya    6572415716
35   Letcher, Basilia (401)437-2309
36    Gallaher, Deena  269-521-6040
37      Pierri, Viola    6572108846
38   Benefiel, Chante    4257611776
39       Phan, Kellye  479-325-3593
40      Cosenza, Saul    2165746335
41    Neihoff, Velvet  337-314-5395
42   Arboleda, Lynsey (306)409-9494
43   Metcalfe, Mervin (319)219-2300
44    Hammes, Stefani  630-629-4630
45   Nordahl, Yahaira (610)390-8353
46   Nader, Marceline  660-299-3416
47   Lasorsa, Vicente    7135491648
48   Bessette, Esther    4257614047
49 Hinchman, Marisela    8479223654
50  Lippincott, Lucia  631-512-5400

`tidyr::extract()` Does the Job!

desired <-
  NamePhone %>% 
  tidyr::extract(
    col = name,
    into = c("last", "first"),
    regex = "(\\w+), (\\w+)"
  ) %>% 
  tidyr::extract(
    col = phone,
    into = c("area", "office", "line"),
    regex = "(\\d{3})\\D*(\\d{3})\\D*(\\d{4})"
  )

desired

         last     first area office line
1     Philson    Mickey  580    789 5775
2      Shiner    Marget  206    948 8169
3   Sackrider    Dionne  432    297 3683
4    Kukowski    Isobel  240    619 8432
5    Isenhour     Garth  641    782 3425
6     Kapinos      Enid  601    872 3027
7      Blaker  Theodore  510    812 9092
8    Crossett   Rosaura  606    329 2954
9    Northern     Willy  551    427 1399
10     Goettl   Latonia  303    242 6982
11   Campagna     Ryann  727    692 1835
12       Wash      Mira  509    216 3598
13  Flansburg    Louann  304    916 3908
14   Winborne   Angella  678    249 9107
15    Arledge    Marcia  430    625 4239
16    Cookson    Eladia  507    588 4874
17     Tisher       Dee  470    439 4114
18    Difiore    Tyrell  405    529 4829
19      Colas   Tristan  785    792 3661
20   Sprenger       Ava  217    343 9603
21     Getman   Jesenia  646    812 6606
22      Starr    Ashley  281    514 6984
23      Raney   Irmgard  573    586 5935
24     Bryson    Dionna  325    627 2149
25       Welk     Bruno  289    478 2665
26       Dias     Petra  269    436 1985
27  Alejandro      Nana  254    563 7229
28     Sanson     Jason  469    453 3600
29    Ellerbe    Gracia  320    749 5706
30     Parris    Julius  630    537 5563
31  Tomasello   Rachele  240    696 2942
32    Tackitt  Mireille  331    202 8129
33 Taliaferro    Kaycee  762    272 8177
34   Imperato   Natalya  657    241 5716
35    Letcher   Basilia  401    437 2309
36   Gallaher     Deena  269    521 6040
37     Pierri     Viola  657    210 8846
38   Benefiel    Chante  425    761 1776
39       Phan    Kellye  479    325 3593
40    Cosenza      Saul  216    574 6335
41    Neihoff    Velvet  337    314 5395
42   Arboleda    Lynsey  306    409 9494
43   Metcalfe    Mervin  319    219 2300
44     Hammes   Stefani  630    629 4630
45    Nordahl   Yahaira  610    390 8353
46      Nader Marceline  660    299 3416
47    Lasorsa   Vicente  713    549 1648
48   Bessette    Esther  425    761 4047
49   Hinchman  Marisela  847    922 3654
50 Lippincott     Lucia  631    512 5400

What if You Want to Keep the Original Columns?

desired2 <-
  NamePhone %>% 
  tidyr::extract(
    col = name,
    into = c("last", "first"),
    regex = "(\\w+), (\\w+)",
    remove = FALSE
  ) %>% 
  tidyr::extract(
    col = phone,
    into = c("area", "office", "line"),
    regex = "(\\d{3})\\D*(\\d{3})\\D*(\\d{4})",
    remove = FALSE
  )

desired2

                 name       last     first         phone area office line
1     Philson, Mickey    Philson    Mickey  580-789-5775  580    789 5775
2      Shiner, Marget     Shiner    Marget (206)948-8169  206    948 8169
3   Sackrider, Dionne  Sackrider    Dionne (432)297-3683  432    297 3683
4    Kukowski, Isobel   Kukowski    Isobel (240)619-8432  240    619 8432
5     Isenhour, Garth   Isenhour     Garth    6417823425  641    782 3425
6       Kapinos, Enid    Kapinos      Enid    6018723027  601    872 3027
7    Blaker, Theodore     Blaker  Theodore (510)812-9092  510    812 9092
8   Crossett, Rosaura   Crossett   Rosaura    6063292954  606    329 2954
9     Northern, Willy   Northern     Willy  551-427-1399  551    427 1399
10    Goettl, Latonia     Goettl   Latonia (303)242-6982  303    242 6982
11    Campagna, Ryann   Campagna     Ryann  727-692-1835  727    692 1835
12         Wash, Mira       Wash      Mira  509-216-3598  509    216 3598
13  Flansburg, Louann  Flansburg    Louann    3049163908  304    916 3908
14  Winborne, Angella   Winborne   Angella (678)249-9107  678    249 9107
15    Arledge, Marcia    Arledge    Marcia (430)625-4239  430    625 4239
16    Cookson, Eladia    Cookson    Eladia  507-588-4874  507    588 4874
17        Tisher, Dee     Tisher       Dee  470-439-4114  470    439 4114
18    Difiore, Tyrell    Difiore    Tyrell    4055294829  405    529 4829
19     Colas, Tristan      Colas   Tristan    7857923661  785    792 3661
20      Sprenger, Ava   Sprenger       Ava (217)343-9603  217    343 9603
21    Getman, Jesenia     Getman   Jesenia (646)812-6606  646    812 6606
22      Starr, Ashley      Starr    Ashley (281)514-6984  281    514 6984
23     Raney, Irmgard      Raney   Irmgard  573-586-5935  573    586 5935
24     Bryson, Dionna     Bryson    Dionna (325)627-2149  325    627 2149
25        Welk, Bruno       Welk     Bruno    2894782665  289    478 2665
26        Dias, Petra       Dias     Petra    2694361985  269    436 1985
27    Alejandro, Nana  Alejandro      Nana  254-563-7229  254    563 7229
28      Sanson, Jason     Sanson     Jason (469)453-3600  469    453 3600
29    Ellerbe, Gracia    Ellerbe    Gracia  320-749-5706  320    749 5706
30     Parris, Julius     Parris    Julius  630-537-5563  630    537 5563
31 Tomasello, Rachele  Tomasello   Rachele (240)696-2942  240    696 2942
32  Tackitt, Mireille    Tackitt  Mireille    3312028129  331    202 8129
33 Taliaferro, Kaycee Taliaferro    Kaycee    7622728177  762    272 8177
34  Imperato, Natalya   Imperato   Natalya    6572415716  657    241 5716
35   Letcher, Basilia    Letcher   Basilia (401)437-2309  401    437 2309
36    Gallaher, Deena   Gallaher     Deena  269-521-6040  269    521 6040
37      Pierri, Viola     Pierri     Viola    6572108846  657    210 8846
38   Benefiel, Chante   Benefiel    Chante    4257611776  425    761 1776
39       Phan, Kellye       Phan    Kellye  479-325-3593  479    325 3593
40      Cosenza, Saul    Cosenza      Saul    2165746335  216    574 6335
41    Neihoff, Velvet    Neihoff    Velvet  337-314-5395  337    314 5395
42   Arboleda, Lynsey   Arboleda    Lynsey (306)409-9494  306    409 9494
43   Metcalfe, Mervin   Metcalfe    Mervin (319)219-2300  319    219 2300
44    Hammes, Stefani     Hammes   Stefani  630-629-4630  630    629 4630
45   Nordahl, Yahaira    Nordahl   Yahaira (610)390-8353  610    390 8353
46   Nader, Marceline      Nader Marceline  660-299-3416  660    299 3416
47   Lasorsa, Vicente    Lasorsa   Vicente    7135491648  713    549 1648
48   Bessette, Esther   Bessette    Esther    4257614047  425    761 4047
49 Hinchman, Marisela   Hinchman  Marisela    8479223654  847    922 3654
50  Lippincott, Lucia Lippincott     Lucia  631-512-5400  631    512 5400

Counting Matches

Example

Count the number of words in a string that begin with a lower or uppercase p.

more_strings <- c(
  "Mary Poppins is practically perfect in every way!",
  "The best-laid plans of mice and men gang oft astray.",
  "Peter Piper picked a peck of pickled peppers."
)

A More Challenging Task

How might we find the words in a string that contain three or more of the same letter?

Regex Modes

Modes

In some other languages you see a regex like this:

/regex/gm

Modes are at the end. Popular modes:

g: “global”, looking for all possible matches in the string;
i: “case-insensitive” mode, so that letter-characters in the regex match both their upper and lower-case versions;
m: “multiline” mode, so that the anchors ^ and $ are attached to newlines within the string rather than to the absolute beginning and end of the string;
x: “white-space” mode, where white-spaces in the regex are ignored unless they are escaped (useful for lining out the regex and inserting comments to explain its operation).

Note on Global Mode

Since stringr has _all versions of the main regex functions, we don’t usually have to worry about setting global mode in R.

Setting Modes in R

Use (? )whenever you want the modes to take effect. (Usually at the beginning of the string.)

Example:

"(?im)t[aeiou]{1,3}$"

At the end of lines in the string we are looking for t (or T) followed by one to three vowels. Uppercase, or lowercase – doesn’t matter.

Comments With “Ignore Whitespace” Mode

myPattern <-
  "(?xi)       # ignore whitespace (x) and ignore case (i)
  \\b          # assert a word-boundary
  (\\w)        # capture the first letter of the first word
  \\w*         # rest of the first word
  \\W+         # one or more non-word characters
  \\1          # repeat the letter captured previously
  \\w*         # rest of the second word
  "

Application

The big bad wolf is walking warily to the cottage.
He huffs and he puffs peevishly.
He wears gnarly gargantuan bell bottoms!

Practice

Let’s try some of these ideas:

https://homerhanumat.github.io/r-notes/110-regex.html#practice-exercises

Regular Expressions in R

Packages

String to Regex

Example: Splitting on a Space

Example: Splitting on Whitespace

The Problem

Solution

String-to-Regex Examples

Substitution

Example: Messy Dates

Solution: str_replace_all()

Aside: str_replace()

Patterned Replacement

Solution

Replacement Via Function

Replacement Via Function (Again)

Detecting Matches

Getting the Strings That Contain a Match

Knowing WHEN there is a Match

Knowing WHERE There is a Match

Knowing WHERE (All)

Extracting Matches

Example

Extract ALL the Matches in Each Sentence

More Info With str_match():

Even More Info

Extraction in Data Frames

tidyr::extract() Does the Job!

What if You Want to Keep the Original Columns?

Counting Matches

Example

A More Challenging Task

Regex Modes

Modes

Note on Global Mode

Setting Modes in R

Comments With “Ignore Whitespace” Mode

Application

Practice

Solution: `str_replace_all()`

Aside: `str_replace()`

More Info With `str_match()`:

`tidyr::extract()` Does the Job!