Introduction to Regular Expressions (regex) in R

If you haven’t used regular expressions (regex) before, they are basically a way to write search patterns for strings. I’ve always found them to be inscrutable and unintuitive, so even though the search pattern I have in mind is usually very simple and should, in theory, require only a basic regex, I always have to Google what the correct syntax is. Today, I’m going to try to solve this problem by writing my own guide to regex.

R commands

First, let’s go over the R functions that you are likely going to use regex with:

grep: Search for matches of the regex in a vector and returns the indices of elements that match if value = F or the elements themselves if value = T
grepl: Search for matches of the regex in a vector and return a boolean vector
sub: Replace the first match of the regex with a new string
gsub: Replace all matches of the regex with a new string
regexpr: Return starting position (-1 if none) and length of first match
grepexpr: Return starting positions (-1 if none) and lengths of all matches
regmatches: Extract matched substrings obtained by regexpr. If invert = T, extract the substrings that don’t match to the regex.

Examples

To demonstrate how the different regular expressions work, I will use the following example vector.

1 2	example_vec = c("The big ocean", "The big Ocean", "The big ocean.", "The big. ocean!", "The big.. ocean!") example_vec %>% cat(., sep = "\n")

## The big ocean
## The big Ocean
## The big ocean.
## The big. ocean!
## The big.. ocean!

Anchors

^: Start of string
$: End of string

1 2	# Find elements that end with "ocean" grep("ocean$", example_vec, value = T) %>% cat(., sep = "\n")

1	## The big ocean

More Characters

.: Match any character
[a-z], [0-9], [Tt], [[:punct:]], [[:alpha:]], [[:lower:]], [[:upper:]],[[:digit:]]: Search for any of the options in the brackets (i.e. “character classes”)
\\: Escape a character.¹ For example, to search for a period, you need to use \\..
|: “OR” operand
^: “NOT” operand

1 2	# Find elements that end with "ocean" plus any character grep("ocean.$", example_vec, value = T) %>% cat(., sep = "\n")

1
2
3

## The big ocean.
## The big. ocean!
## The big.. ocean!

1 2	# Find elements that end with "ocean" or "Ocean" grep("[Oo]cean$", example_vec, value = T) %>% cat(., sep = "\n")

1 2	## The big ocean ## The big Ocean

1 2	# Find elements that end with a punctuation grep("[[:punct:]]$", example_vec, value = T) %>% cat(., sep = "\n")

1
2
3

## The big ocean.
## The big. ocean!
## The big.. ocean!

1 2	# Find elements that end with a period grep("\\.$", example_vec, value = T) %>% cat(., sep = "\n")

1	## The big ocean.

1 2	# Find elements that end with "ocean" or "Ocean" grep("(ocean$\|Ocean$)", example_vec, value = T) %>% cat(., sep = "\n")

1 2	## The big ocean ## The big Ocean

1 2	# Find elements that end with "cean" but not "Ocean" grep("[^O]cean$", example_vec, value = T) %>% cat(., sep = "\n")

1	## The big ocean

Repetition

*: Repeat 0 or more times
+: Repeat 1 or more times
{2}: Repeat exactly 2 times
{2, 4}: Repeat 2 to 4 times

1 2	# Find elements that start with "The big." grep("^The big\\.+", example_vec, value = T) %>% cat(., sep = "\n")

1 2	## The big. ocean! ## The big.. ocean!

1 2	# Replace substring beginning with the first period, if it exists gsub("\\.(.*)$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")

## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The bigREPLACEMENT

1 2	# Replace substring beginning with the last period, if it exists gsub("\\.[^\\.]*$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")

## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The big.REPLACEMENT

1 2	# Find matched substrings beginning with the first period regmatches(example_vec, regexpr("\\.(.*)$", example_vec)) %>% cat(., sep = "\n")

1
2
3

## .
## . ocean!
## .. ocean!

1 2	# Find the substrings that don't match to "big.." regmatches(example_vec, regexpr("big(\\.{2})", example_vec), invert = T)

## [[1]]
## [1] "The big ocean"
## 
## [[2]]
## [1] "The big Ocean"
## 
## [[3]]
## [1] "The big ocean."
## 
## [[4]]
## [1] "The big. ocean!"
## 
## [[5]]
## [1] "The "    " ocean!"

Resources

Here are some resources and references I used while writing this blog post.

You can test or look up your regex at https://regex101.com/#python
RStudio cheat sheet
JHU Data Science lecture on regex

For more details about why a double backslash is necessary, see https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex.↩︎