(Section 12.4)
Make sure these are attached:
[1] "hello" "there" "Mary" "Poppins"
The pattern
parameter in str_*()
functions takes a string that is converted to a regular expression by R.
Try to split on the whitespace:
Before R can convert pattern
to a regex, it has to understand pattern
for what it is: a string!
But \s
is not a control character!
So to get a literal \s
, we must escape the \
:
Moral: Always escape again when the regex you intend uses a backslash for escaping!
Regular Expression | Entered as String |
---|---|
\s+ | “\\s+” |
find\.dot | “find\\.dot” |
^\w*\d{1,3}$ | “^\\w*\\d{1,3}$” |
Suppose that we have a vector of dates:
We would prefer them to be in just ONE format!
str_replace_all()
[1] "3/14/1963" "4/13/2005" "12/1/1997" "11/11/1918"
Parameters:
x
(not written out here, due to the piping) is the text in which the substitution occurs;pattern
is the regex for the type of sub-string we want to replace;replacment
is what we want to replace matches of the pattern with.str_replace()
[1] "3/14 - 1963" "4/13/ 2005" "12/1-1997" "11/11 / 1918"
Task: write a function called doubleVowels()
that replaces every vowel with its double:
Example of use:
The replacement
argument can include regex features that refer to elements of the pattern
!
Capitalize every vowel:
Put asterisks around every word-repetition:
Some sentences:
Select all and only the strings that contain a word beginning with capital T.
If you just need to know whether or not there is a match:
(This only addresses the first match in a string.)
Extract pairs of words beginning with the same letter in sentences2
defined below:
[[1]]
[1] "big bad" "walking warily" "to the"
[[2]]
[1] "puffs peevishly"
[[3]]
[1] "gnarly gargantuan" "bell bottoms"
You get a list.
str_match()
: [,1] [,2]
[1,] "big bad" "b"
[2,] "puffs peevishly" "p"
[3,] "gnarly gargantuan" "g"
\\1
capture-group.Recall our motivating example:
tidyr::extract()
Does the Job!Count the number of words in a string that begin with a lower or uppercase p
.
How might we find the words in a string that contain three or more of the same letter?
In some other languages you see a regex like this:
/regex/gm
Modes are at the end. Popular modes:
g
: “global”, looking for all possible matches in the string;i
: “case-insensitive” mode, so that letter-characters in the regex match both their upper and lower-case versions;m
: “multiline” mode, so that the anchors ^
and $
are attached to newlines within the string rather than to the absolute beginning and end of the string;x
: “white-space” mode, where white-spaces in the regex are ignored unless they are escaped (useful for lining out the regex and inserting comments to explain its operation).We since stringr has _all
versions of the main regex functions, we don’t usually have to worry about setting global mode in R.
Use (? )
whenever you want the modes to take effect. (Usually at the beginning of the string.)
Example:
"(?im)t[aeiou]{1,3}$"
At the end of lines in the string we are looking for t (or T) followed by one to three vowels. Uppercase, or lowercase – doesn’t matter.
The stringr package comes with fruit
, a character-vector of giving the names of 80 fruits.
In the word “banana” the string “an” appears twice in succession, as does the string “na”.
Find the fruit-names containing at least one string of length two or more that appears twice in succession.
Comments With “Ignore Whitespace” Mode