Introduction to Regular Expressions (regex) in R
Jan 08, 2020

If you haven’t used regular expressions (regex) before, they are basically a way to write search patterns for strings. I’ve always found them to be inscrutable and unintuitive, so even though the search pattern I have in mind is usually very simple and should, in theory, require only a basic regex, I always have to Google what the correct syntax is. Today, I’m going to try to solve this problem by writing my own guide to regex.1

R commands

First, let’s go over the R functions that you are likely going to use regex with:

  • grep: Search for matches of the regex in a vector and returns the indices of elements that match if value = F or the elements themselves if value = T
  • grepl: Search for matches of the regex in a vector and return a boolean vector
  • sub: Replace the first match of the regex with a new string
  • gsub: Replace all matches of the regex with a new string
  • regexpr: Return starting position (-1 if none) and length of first match
  • grepexpr: Return starting positions (-1 if none) and lengths of all matches
  • regmatches: Extract matched substrings obtained by regexpr. If invert = T, extract the substrings that don’t match to the regex.

Examples

To demonstrate how the different regular expressions work, I will use the following example vector.

example_vec = c("The big ocean", "The big Ocean", "The big ocean.", "The big. ocean!", "The big.. ocean!")
example_vec %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big ocean.
## The big. ocean!
## The big.. ocean!

Anchors

  • ^: Start of string
  • $: End of string
# Find elements that end with "ocean"
grep("ocean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean

More Characters

  • .: Match any character
  • [a-z], [0-9], [Tt], [[:punct:]], [[:alpha:]], [[:lower:]], [[:upper:]],[[:digit:]]: Search for any of the options in the brackets (i.e. “character classes”)
  • \\: Escape a character.2 For example, to search for a period, you need to use \\..
  • |: “OR” operand
  • ^: “NOT” operand
# Find elements that end with "ocean" plus any character
grep("ocean.$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
## The big. ocean!
## The big.. ocean!
# Find elements that end with "ocean" or "Ocean"
grep("[Oo]cean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
# Find elements that end with a punctuation
grep("[[:punct:]]$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
## The big. ocean!
## The big.. ocean!
# Find elements that end with a period
grep("\\.$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
# Find elements that end with "ocean" or "Ocean"
grep("(ocean$|Ocean$)", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
# Find elements that end with "cean" but not "Ocean"
grep("[^O]cean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean

Repetition

  • *: Repeat 0 or more times
  • +: Repeat 1 or more times
  • {2}: Repeat exactly 2 times
  • {2, 4}: Repeat 2 to 4 times
# Find elements that start with "The big."
grep("^The big\\.+", example_vec, value = T) %>% cat(., sep = "\n")
## The big. ocean!
## The big.. ocean!
# Replace substring beginning with the first period, if it exists
gsub("\\.(.*)$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The bigREPLACEMENT
# Replace substring beginning with the last period, if it exists
gsub("\\.[^\\.]*$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The big.REPLACEMENT
# Find matched substrings beginning with the first period
regmatches(example_vec, regexpr("\\.(.*)$", example_vec)) %>% cat(., sep = "\n")
## .
## . ocean!
## .. ocean!
# Find the substrings that don't match to "big.."
regmatches(example_vec, regexpr("big(\\.{2})", example_vec), invert = T)
## [[1]]
## [1] "The big ocean"
## 
## [[2]]
## [1] "The big Ocean"
## 
## [[3]]
## [1] "The big ocean."
## 
## [[4]]
## [1] "The big. ocean!"
## 
## [[5]]
## [1] "The "    " ocean!"

Resources

Here are some resources and references I used while writing this blog post.


  1. As I continue to familiarize myself with regex, I will update this post.↩︎

  2. For more details about why a double backslash is necessary, see https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex.↩︎


back · new? start here · posts · about · main