Strings

(All Sections)

Character Vectors: Strings

Definition

We have met strings before.

String

A sequence of characters


Quotes as Delimiters

We use quotes to mark the beginning and end of a string.

greeting <- "hello"
typeof(greeting)
[1] "character"

Always a Vector

Like all data types in R, strings do not exist on their own. They are always elements of a vector.

is.vector(greeting)
[1] TRUE
length(greeting)
[1] 1

Single or Double Quotes …

… can be used as delimiters. Your choice won’t affect the value of the string:

greeting1 <- "hello"
greeting2 <- 'hello'
greeting1 == greeting2
[1] TRUE

You Can Mix Them …

… in a vector:

politeWords <- c("Please?", 'Thank you!')
politeWords
[1] "Please?"    "Thank you!"

Best Practice in R: Use double-quotes whenever possible.

Characters and Special Characters

A Naive View of Characters

Naively construed, characters are the things you can type on your computer keyboard:

  • the lower-case letters a-z;
  • the upper case letters A-Z;
  • the digits 0,1, …, 9 (0-9);
  • the punctuation characters: ., -, ?, !, ;, :, etc. (and of course the comma, too!)
  • a few other special-use characters: ~, @, #, $, %, _, +, =, and so on;
  • and the space, too!

And Quote-Marks, Too!

You can include quotes in a string. But you have to be careful. For example, how would you get the following in your Console?

## "Welcome", she said, "the coffee's on me!"

This doesn’t work:

cat(""Welcome", she said, "the coffee's on me!"")
Error: <text>:1:7: unexpected symbol
1: cat(""Welcome
          ^

Escaping

You have to escape the special meaning of the quotes-marks, if you want them appear inside a string:

cat("\"Welcome\", she said, \"the coffee's on me!\"")
"Welcome", she said, "the coffee's on me!"

Control Characters

The back-slash character \ is an example of a special character called a control character.

Control Character

A member of a character set that does not represent a written symbol.


The function of \ is to escape the regular meaning of the character immediately following it.

But What If I Want to Print a \?

Suppose you need to write:

The Windows path is:  "C:\\Inetpub\\vhosts\\example.com".

Then you must escape the backslash with another backslash!

cat("The Windows path is:  \"C:\\\\Inetpub\\\\vhosts\\\\example.com\".")
The Windows path is:  "C:\\Inetpub\\vhosts\\example.com".

Other Control Characters

Other control characters can be inserted into a string with the \. For example, we have already met newline:

cat("Old MacDonald had a farm,\nee-i-ee-i-o!")
Old MacDonald had a farm,
ee-i-ee-i-o!

The backslash escapes the ordinary meaning of n, making it stand for the newline control character instead.

More Control Characters

Character Meaning
\n newline
\r carriage return
\t tab
\b backspace
\a alert (bell)
\f form feed
\v vertical tab

Try Them Out!

cat("Hell\to")
cat("Hell\ro")

Mostly we will stick with:

  • \
  • \t
  • \n

Unicode

The \ can generate non-control characters, too. For example, it can help form Unicode characters.

Unicode

A computing-industry standard for the consistent encoding of text in most of the world’s written languages.


Examples

cat("\u{2603}")  # the Snowman
cat("Hello\u{202e}there, Friend!")  # the wicked reverser!
Hello‮there, Friend!

Basic String Operations

The stringr Package

stringr comes along with the tidyverse.

We’ll use it a lot for basic manipulation of strings.

String Length

How many characters are in the string "hello"?

str_length("hello")
[1] 5

Note that the following gives the wrong answer:

length("hello")
[1] 1

Vectorizable

Many basic strings operations are vector-in, vector-out:

str_length(c("Mary", "Poppins"))
[1] 4 7

Sub-strings

poppins <- "Supercalifragilisticexpialidocious"
str_sub(poppins, start = 10, end = 20)
[1] "fragilistic"

str_sub to Assign

You can assign a new value to part of a string:

str_sub(poppins, start = 10, end = 20) <- "ABCDEFGHIJK"

Let’s see if that worked:

poppins
[1] "SupercaliABCDEFGHIJKexpialidocious"

Vectorizable

Many basic strings operations are vector-in, vector-out:

str_length(c("Mary", "Poppins"))
[1] 4 7
str_sub(c("Mary", "Poppins"), 1, 3)
[1] "Mar" "Pop"

Trimming

Watch closely:

lastWord <- "farewell\r\n"
str_length(lastWord)
[1] 10
cat(lastWord)
farewell

How to Trim Off the Whitespace?

Use str_trim():

noWhiteSpace  <- str_trim(lastWord)
str_length(noWhiteSpace)
[1] 8

Changing Cases

You can make all of the letters in a string lowercase:

str_to_lower("My name is Rhonda.")
[1] "my name is rhonda."

You can make them all uppercase:

str_to_upper("It makes me wanna holler!")
[1] "IT MAKES ME WANNA HOLLER!"

Splitting Strings

Consider the following character vector that records several dates:

dates <- c("3-14-1963", "04-01-1965", "12-2-1983")

How can you get access to each element (month, day, year)?

str_split()

str_split() will do the job for you:

str_split(dates, pattern = "-")
[[1]]
[1] "3"    "14"   "1963"

[[2]]
[1] "04"   "01"   "1965"

[[3]]
[1] "12"   "2"    "1983"

This is a list!

Unlist and Process

dates %>% 
  str_split(pattern = "-") %>% 
  unlist() %>% 
  .[c(1, 4, 7)] %>% 
  as.numeric() %>% 
  month.name[.]
[1] "March"    "April"    "December"

Word-by-Word Splitting

Let’s split a string into its words:

"you have won the lottery" %>% 
  str_split(pattern = " ") %>% 
  unlist()
[1] "you"     "have"    "won"     "the"     "lottery"

Beware

Splitting on the space would not have worked if some of the words had been separated by more than one space:

"you have won the  lottery" %>% # two spaces between 'the' and 'lottery'
  str_split(pattern = " ") %>% 
  unlist()
[1] "you"     "have"    "won"     "the"     ""        "lottery"

We’ll address this issue in the next Chapter.

Split Into Characters

In order to split a string into its constituent characters, split on the empty string:

"aardvark" %>% 
  str_split(pattern = "") %>% 
  unlist()
[1] "a" "a" "r" "d" "v" "a" "r" "k"

Counting Occurrences of a Given Character

You could use this idea to, say, count the number of occurrences of “a” in a word:

"aardvark" %>% 
  str_split(pattern = "") %>% 
  unlist() %>% 
  .[. == "a"] %>% 
  length()
[1] 3

But …

stringr is way ahead of you, there:

str_count("aardvark", pattern = "a")
[1] 3
str_count("Mississippi", pattern = "ss")
[1] 2

Joining Strings

The stringr counterpart to paste() is str_c():

str_c("Yabba","dabba","doo!")
[1] "Yabbadabbadoo!"

The default is to separate the arguments with the empty string. But you can separate by something else:

str_c("Yabba","dabba","doo!", sep = "-")
[1] "Yabba-dabba-doo!"

Joining the Elements of a Vector

poppins <- c(
  "practically", "perfect", "in",
  "every", "way"
)

This doesn’t work:

str_c(poppins)
[1] "practically" "perfect"     "in"          "every"       "way"        

This does:

str_c(poppins, collapse = " ")
[1] "practically perfect in every way"