If you haven’t used regular expressions (regex) before, they are basically a way to write search patterns for strings. I’ve always found them to be inscrutable and unintuitive, so even though the search pattern I have in mind is usually very simple and should, in theory, require only a basic regex, I always have to Google what the correct syntax is. Today, I’m going to try to solve this problem by writing my own guide to regex.
R commands
First, let’s go over the R functions that you are likely going to use regex with:
grep: Search for matches of the regex in a vector and returns the indices of elements that match if value = F or the elements themselves if value = T
grepl: Search for matches of the regex in a vector and return a boolean vector
sub: Replace the first match of the regex with a new string
gsub: Replace all matches of the regex with a new string
regexpr: Return starting position (-1 if none) and length of first match
grepexpr: Return starting positions (-1 if none) and lengths of all matches
regmatches: Extract matched substrings obtained by regexpr. If invert = T, extract the substrings that don’t match to the regex.
Examples
To demonstrate how the different regular expressions work, I will use the following example vector.
1
2
example_vec = c("The big ocean", "The big Ocean", "The big ocean.", "The big. ocean!", "The big.. ocean!")
example_vec %>% cat(., sep = "\n")
1
2
3
4
5
## The big ocean
## The big Ocean
## The big ocean.
## The big. ocean!
## The big.. ocean!
Anchors
^: Start of string
$: End of string
1
2
# Find elements that end with "ocean"
grep("ocean$", example_vec, value = T) %>% cat(., sep = "\n")
1
## The big ocean
More Characters
.: Match any character
[a-z], [0-9], [Tt], [[:punct:]], [[:alpha:]], [[:lower:]], [[:upper:]],[[:digit:]]: Search for any of the options in the brackets (i.e. “character classes”)
\\: Escape a character.1 For example, to search for a period, you need to use \\..
|: “OR” operand
^: “NOT” operand
1
2
# Find elements that end with "ocean" plus any character
grep("ocean.$", example_vec, value = T) %>% cat(., sep = "\n")
1
2
3
## The big ocean.
## The big. ocean!
## The big.. ocean!
1
2
# Find elements that end with "ocean" or "Ocean"
grep("[Oo]cean$", example_vec, value = T) %>% cat(., sep = "\n")
1
2
## The big ocean
## The big Ocean
1
2
# Find elements that end with a punctuation
grep("[[:punct:]]$", example_vec, value = T) %>% cat(., sep = "\n")
1
2
3
## The big ocean.
## The big. ocean!
## The big.. ocean!
1
2
# Find elements that end with a period
grep("\\.$", example_vec, value = T) %>% cat(., sep = "\n")
1
## The big ocean.
1
2
# Find elements that end with "ocean" or "Ocean"
grep("(ocean$|Ocean$)", example_vec, value = T) %>% cat(., sep = "\n")
1
2
## The big ocean
## The big Ocean
1
2
# Find elements that end with "cean" but not "Ocean"
grep("[^O]cean$", example_vec, value = T) %>% cat(., sep = "\n")