If you haven’t used regular expressions (regex) before, they are basically a way to write search patterns for strings. I’ve always found them to be inscrutable and unintuitive, so even though the search pattern I have in mind is usually very simple and should, in theory, require only a basic regex, I always have to Google what the correct syntax is. Today, I’m going to try to solve this problem by writing my own guide to regex.

## R commands

First, let’s go over the R functions that you are likely going to use regex with:

• grep: Search for matches of the regex in a vector and returns the indices of elements that match if value = F or the elements themselves if value = T
• grepl: Search for matches of the regex in a vector and return a boolean vector
• sub: Replace the first match of the regex with a new string
• gsub: Replace all matches of the regex with a new string
• regexpr: Return starting position (-1 if none) and length of first match
• grepexpr: Return starting positions (-1 if none) and lengths of all matches
• regmatches: Extract matched substrings obtained by regexpr. If invert = T, extract the substrings that don’t match to the regex.

## Examples

To demonstrate how the different regular expressions work, I will use the following example vector.

example_vec = c("The big ocean", "The big Ocean", "The big ocean.", "The big. ocean!", "The big.. ocean!")
example_vec %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big ocean.
## The big. ocean!
## The big.. ocean!

### Anchors

• ^: Start of string
• $: End of string # Find elements that end with "ocean" grep("ocean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean

### More Characters

• .: Match any character
• [a-z], [0-9], [Tt], [[:punct:]], [[:alpha:]], [[:lower:]], [[:upper:]],[[:digit:]]: Search for any of the options in the brackets (i.e. “character classes”)
• \\: Escape a character.1 For example, to search for a period, you need to use \\..
• |: “OR” operand
• ^: “NOT” operand
# Find elements that end with "ocean" plus any character
grep("ocean.$", example_vec, value = T) %>% cat(., sep = "\n") ## The big ocean. ## The big. ocean! ## The big.. ocean! # Find elements that end with "ocean" or "Ocean" grep("[Oo]cean$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
# Find elements that end with a punctuation
grep("[[:punct:]]$", example_vec, value = T) %>% cat(., sep = "\n") ## The big ocean. ## The big. ocean! ## The big.. ocean! # Find elements that end with a period grep("\\.$", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean.
# Find elements that end with "ocean" or "Ocean"
grep("(ocean$|Ocean$)", example_vec, value = T) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
# Find elements that end with "cean" but not "Ocean"
grep("[^O]cean$", example_vec, value = T) %>% cat(., sep = "\n") ## The big ocean ### Repetition • *: Repeat 0 or more times • +: Repeat 1 or more times • {2}: Repeat exactly 2 times • {2, 4}: Repeat 2 to 4 times # Find elements that start with "The big." grep("^The big\\.+", example_vec, value = T) %>% cat(., sep = "\n") ## The big. ocean! ## The big.. ocean! # Replace substring beginning with the first period, if it exists gsub("\\.(.*)$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n")
## The big ocean
## The big Ocean
## The big oceanREPLACEMENT
## The bigREPLACEMENT
## The bigREPLACEMENT
# Replace substring beginning with the last period, if it exists
gsub("\\.[^\\.]*$", "REPLACEMENT", example_vec) %>% cat(., sep = "\n") ## The big ocean ## The big Ocean ## The big oceanREPLACEMENT ## The bigREPLACEMENT ## The big.REPLACEMENT # Find matched substrings beginning with the first period regmatches(example_vec, regexpr("\\.(.*)$", example_vec)) %>% cat(., sep = "\n")
## .
## . ocean!
## .. ocean!
# Find the substrings that don't match to "big.."
regmatches(example_vec, regexpr("big(\\.{2})", example_vec), invert = T)
## [[1]]
## [1] "The big ocean"
##
## [[2]]
## [1] "The big Ocean"
##
## [[3]]
## [1] "The big ocean."
##
## [[4]]
## [1] "The big. ocean!"
##
## [[5]]
## [1] "The "    " ocean!"

## Resources

Here are some resources and references I used while writing this blog post.