Introduction to Regular Expressions (regex) in R
If you haven’t used regular expressions (regex) before, they are basically a way to write search patterns for strings. I’ve always found them to be inscrutable and unintuitive, so even though the search pattern I have in mind is usually very simple and should, in theory, require only a basic regex, I always have to Google what the correct syntax is. Today, I’m going to try to solve this problem by writing my own guide to regex.
R commands
First, let’s go over the R functions that you are likely going to use regex with:
grep
: Search for matches of the regex in a vector and returns the indices of elements that match ifvalue = F
or the elements themselves ifvalue = T
grepl
: Search for matches of the regex in a vector and return a boolean vectorsub
: Replace the first match of the regex with a new stringgsub
: Replace all matches of the regex with a new stringregexpr
: Return starting position (-1 if none) and length of first matchgrepexpr
: Return starting positions (-1 if none) and lengths of all matchesregmatches
: Extract matched substrings obtained byregexpr
. Ifinvert = T
, extract the substrings that don’t match to the regex.
Examples
To demonstrate how the different regular expressions work, I will use the following example vector.
example_vec = c("The big ocean", "The big Ocean", "The big ocean.", "The big. ocean!", "The big.. ocean!")
example_vec %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big ocean.
## The big. ocean!
## The big.. ocean!
Anchors
^
: Start of string$
: End of string
# Find elements that end with "ocean"
grep("ocean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
More Characters
.
: Match any character[a-z]
,[0-9]
,[Tt]
,[[:punct:]]
,[[:alpha:]]
,[[:lower:]]
,[[:upper:]]
,[[:digit:]]
: Search for any of the options in the brackets (i.e. “character classes”)\\
: Escape a character.1 For example, to search for a period, you need to use\\.
.|
: “OR” operand^
: “NOT” operand
# Find elements that end with "ocean" plus any character
grep("ocean.$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
## The big. ocean!
## The big.. ocean!
# Find elements that end with "ocean" or "Ocean"
grep("[Oo]cean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
# Find elements that end with a punctuation
grep("[[:punct:]]$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
## The big. ocean!
## The big.. ocean!
# Find elements that end with a period
grep("\\.$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
# Find elements that end with "ocean" or "Ocean"
grep("(ocean$|Ocean$)", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
# Find elements that end with "cean" but not "Ocean"
grep("[^O]cean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
Repetition
*
: Repeat 0 or more times+
: Repeat 1 or more times{2}
: Repeat exactly 2 times{2, 4}
: Repeat 2 to 4 times
# Find elements that start with "The big."
grep("^The big\\.+", example_vec, value = T) %>% cat(., sep = "\n")
## The big. ocean!
## The big.. ocean!
# Replace substring beginning with the first period, if it exists
gsub("\\.(.*)$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The bigREPLACEMENT
# Replace substring beginning with the last period, if it exists
gsub("\\.[^\\.]*$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The big.REPLACEMENT
# Find matched substrings beginning with the first period
regmatches(example_vec, regexpr("\\.(.*)$", example_vec)) %>% cat(., sep = "\n")
## .
## . ocean!
## .. ocean!
# Find the substrings that don't match to "big.."
regmatches(example_vec, regexpr("big(\\.{2})", example_vec), invert = T)
## [[1]]
## [1] "The big ocean"
##
## [[2]]
## [1] "The big Ocean"
##
## [[3]]
## [1] "The big ocean."
##
## [[4]]
## [1] "The big. ocean!"
##
## [[5]]
## [1] "The " " ocean!"
Resources
Here are some resources and references I used while writing this blog post.
- You can test or look up your regex at https://regex101.com/#python
- RStudio cheat sheet
- JHU Data Science lecture on regex
For more details about why a double backslash is necessary, see https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex.↩︎