Regular Expressions in R

(Section 12.4)

Packages

Make sure these are attached:

library(bcscr)
library(tidyverse)

String to Regex

Example: Splitting on a Space

"hello there Mary Poppins" %>% 
  str_split(pattern = " ") %>% 
  unlist()
[1] "hello"   "there"   "Mary"    "Poppins"

The pattern parameter in str_*() functions takes a string that is converted to a regular expression by R.

Example: Splitting on Whitespace

myString <- "hello\t\tthere\n\nMary  \t Poppins"
cat(myString)
hello       there

Mary     Poppins

Try to split on the whitespace:

myString %>% 
  str_split(pattern = "\s+") %>% 
  unlist()
Error: '\s' is an unrecognized escape in character string starting ""\s"

The Problem

Before R can convert pattern to a regex, it has to understand pattern for what it is: a string!

But \s is not a control character!

Solution

So to get a literal \s, we must escape the \:

myString %>% 
  str_split(pattern = "\\s+") %>% 
  unlist()
[1] "hello"   "there"   "Mary"    "Poppins"

Moral: Always escape again when the regex you intend uses a backslash for escaping!

String-to-Regex Examples

Examples of entry of regular expressions as strings, in R.
Regular Expression Entered as String
\s+ “\\s+”
find\.dot “find\\.dot”
^\w*\d{1,3}$ “^\\w*\\d{1,3}$”

Substitution

Example: Messy Dates

Suppose that we have a vector of dates:

dates <- c(
  "3 - 14 - 1963", "4/13/ 2005",
  "12-1-1997", "11 / 11 / 1918"
)

We would prefer them to be in just ONE format!

Solution: str_replace_all()

dates %>% 
  str_replace_all(
    pattern = "[- /]+",
    replacement = "/"
)
[1] "3/14/1963"  "4/13/2005"  "12/1/1997"  "11/11/1918"

Parameters:

  • x (not written out here, due to the piping) is the text in which the substitution occurs;
  • pattern is the regex for the type of sub-string we want to replace;
  • replacment is what we want to replace matches of the pattern with.

Aside: str_replace()

dates %>% 
  str_replace(
    pattern = "[- /]+",
    replacement = "/"
)
[1] "3/14 - 1963"  "4/13/ 2005"   "12/1-1997"    "11/11 / 1918"
  • NOT what we need, here.
  • But if you KNOW there is only one match, use it! (It’s faster because it doesn’t have to search the entire string.)

Patterned Replacement

Task: write a function called doubleVowels() that replaces every vowel with its double:

Example of use:

doubleVowels("Far and away the best!")
[1] "Faar aand aawaay thee beest!"

Solution

doubleVowels <- function(str) {
  str %>% 
    str_replace_all(
      pattern = "([aeiou])", 
      replacement = "\\1\\1"
    )
}

The replacement argument can include regex features that refer to elements of the pattern!

Replacement Via Function

Capitalize every vowel:

capVowels <- function(str) {
  str %>% 
    str_replace_all(
      pattern = "[aeiou]", 
      replacement = function(x) str_to_upper(x)
    )
}
capVowels("Far and away the best!")
[1] "FAr And AwAy thE bEst!"

Replacement Via Function (Again)

Put asterisks around every word-repetition:

starRepeats <- function(str) {
  str %>% 
    str_replace_all(
      pattern = "\\b(\\w+) \\1\\b",
      replacement = function(x) {
        str_c("*", x, "*")
      }
    )
}
starRepeats("I have a boo boo on my knee knee.")
[1] "I have a *boo boo* on my *knee knee*."

Detecting Matches

Getting the Strings That Contain a Match

Some sentences:

sentences <- c(
  "My name is Tom, Sir",
  "And I'm Tulip!",
  "Whereas my name is Lester."
)

Select all and only the strings that contain a word beginning with capital T.

sentences %>% 
  str_subset(pattern = "\\bT\\w*\\b")
[1] "My name is Tom, Sir" "And I'm Tulip!"     

Knowing WHEN there is a Match

If you just need to know whether or not there is a match:

sentences %>% 
  str_detect(pattern = "\\bT\\w*\\b")
[1]  TRUE  TRUE FALSE

Knowing WHERE There is a Match

sentences %>% 
  str_locate(pattern = "\\bT\\w*\\b")
     start end
[1,]    12  14
[2,]     9  13
[3,]    NA  NA

(This only addresses the first match in a string.)

Extracting Matches

Example

Extract pairs of words beginning with the same letter in sentences2 defined below:

sentences2 <- c(
  "The big bad wolf is walking warily to the cottage.",
  "He huffs and he puffs peevishly.",
  "He wears gnarly gargantuan bell bottoms!"
)
sentences2 %>% 
  str_extract(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
[1] "big bad"           "puffs peevishly"   "gnarly gargantuan"

Extract ALL the Matches in Each Sentence

sentences2 %>% 
  str_extract_all(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
[[1]]
[1] "big bad"        "walking warily" "to the"        

[[2]]
[1] "puffs peevishly"

[[3]]
[1] "gnarly gargantuan" "bell bottoms"     

You get a list.

More Info With str_match():

sentences2 %>% 
  str_match(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
     [,1]                [,2]
[1,] "big bad"           "b" 
[2,] "puffs peevishly"   "p" 
[3,] "gnarly gargantuan" "g" 
  • First column gives the entire match.
  • Second column gives the value of the \\1 capture-group.

Even More Info

sentences2 %>% 
  str_match_all(pattern = "\\b(\\w)\\w*\\W+\\1\\w*")
[[1]]
     [,1]             [,2]
[1,] "big bad"        "b" 
[2,] "walking warily" "w" 
[3,] "to the"         "t" 

[[2]]
     [,1]              [,2]
[1,] "puffs peevishly" "p" 

[[3]]
     [,1]                [,2]
[1,] "gnarly gargantuan" "g" 
[2,] "bell bottoms"      "b" 

Extraction in Data Frames

Recall our motivating example:

View(NamePhone)

tidyr::extract() Does the Job!

desired <-
  NamePhone %>% 
  tidyr::extract(
    col = name,
    into = c("last", "first"),
    regex = "(\\w+), (\\w+)"
  ) %>% 
  tidyr::extract(
    col = phone,
    into = c("area", "office", "line"),
    regex = "(\\d{3})\\D*(\\d{3})\\D*(\\d{4})"
  )
View(desired)

What if You Want to Keep the Original Columns?

desired2 <-
  NamePhone %>% 
  tidyr::extract(
    col = name,
    into = c("last", "first"),
    regex = "(\\w+), (\\w+)",
    remove = FALSE
  ) %>% 
  tidyr::extract(
    col = phone,
    into = c("area", "office", "line"),
    regex = "(\\d{3})\\D*(\\d{3})\\D*(\\d{4})",
    remove = FALSE
  )
View(desired)

Counting Matches

Example

Count the number of words in a string that begin with a lower or uppercase p.

strings <- c(
  "Mary Poppins is practically perfect in every way!",
  "The best-laid plans of mice and men gang oft astray.",
  "Peter Piper picked a peck of pickled peppers."
)
strings %>%
  str_count(pattern = "\\b[Pp]\\w*\\b")
[1] 3 1 6

A More Challenging Task

How might we find the words in a string that contain three or more of the same letter?

"In Patagonia, the peerless Peter Piper picked a peck of pickled peppers." %>% 
  str_split("\\W+") %>% 
  unlist() %>% 
  str_subset(pattern = "([[:alpha:]]).*\\1.*\\1")
[1] "Patagonia" "peerless"  "peppers"  

Regex Modes

Modes

In some other languages you see a regex like this:

/regex/gm

Modes are at the end. Popular modes:

  • g: “global”, looking for all possible matches in the string;
  • i: “case-insensitive” mode, so that letter-characters in the regex match both their upper and lower-case versions;
  • m: “multiline” mode, so that the anchors ^ and $ are attached to newlines within the string rather than to the absolute beginning and end of the string;
  • x: “white-space” mode, where white-spaces in the regex are ignored unless they are escaped (useful for lining out the regex and inserting comments to explain its operation).

Note on Global Mode

We since stringr has _all versions of the main regex functions, we don’t usually have to worry about setting global mode in R.

Setting Modes in R

Use (? )whenever you want the modes to take effect. (Usually at the beginning of the string.)

Example:

"(?im)t[aeiou]{1,3}$"

At the end of lines in the string we are looking for t (or T) followed by one to three vowels. Uppercase, or lowercase – doesn’t matter.

Comments With “Ignore Whitespace” Mode

myPattern <-
  "(?xi)       # ignore whitespace (x) and ignore case (i)
  \\b          # assert a word-boundary
  (\\w)        # capture the first letter of the first word
  \\w*         # rest of the first word
  \\W+         # one or more non-word characters
  \\1          # repeat the letter captured previously
  \\w*         # rest of the second word
  "

Application

sentences2 %>% 
  str_match_all(pattern = myPattern)
[[1]]
     [,1]             [,2]
[1,] "big bad"        "b" 
[2,] "walking warily" "w" 
[3,] "to the"         "t" 

[[2]]
     [,1]              [,2]
[1,] "He huffs"        "H" 
[2,] "puffs peevishly" "p" 

[[3]]
     [,1]                [,2]
[1,] "gnarly gargantuan" "g" 
[2,] "bell bottoms"      "b" 

Practice

The stringr package comes with fruit, a character-vector of giving the names of 80 fruits.

  1. Determine how many fruit-names consist of exactly two words.
  2. Find the two-word fruit-names.
  3. Find the indices of the two-word fruit names.
  4. Find the one-word fruit-names that end in “berry”.
  5. Find the fruit-names that contain more than three vowels.

More Practice

In the word “banana” the string “an” appears twice in succession, as does the string “na”.

Find the fruit-names containing at least one string of length two or more that appears twice in succession.