5 R Packages to Simplify Your Data Science Workflow
I just finished my second year in the PhD program, which means 2 years of writing a lot of R code. Today, I wanted to share some useful (and perhaps lesser known) R packages that I use.
- pacman for loading packages
This package contains an awesome function called
p_load. I prefer the concise way it lets you load packages, as opposed to writing
library(package) over and over again. Moreover,
p_load automatically checks if you have installed the package and if you haven’t, installs it for you from CRAN. While this is the only function from
pacman that I consistently use, the package has a couple more intuitively named functions for managing packages (e.g.
p_update) that I can also see being useful, if only I could remember they exist.
library(pacman) p_load(tidyverse, janitor, skimr, broom, here)
- janitor for cleaning data
This is a great package for cleaning and exploring data. Every time I read in a dataset, I almost always use
clean_names, which turns the dataset’s column names into the “snake_case” format, though there are options for other naming preferences. This ensures that all the columns are named in the same format, preventing future typos and headaches.
The package offers a couple additional functions that I also use from time to time. For removing duplicate entries, you can use
get_dupes, and for removing empty rows or columns, you can use
remove_empty. For a pipe-able version of
table, you can use
dt_raw = read_csv("test.csv") dt = dt_raw %>% clean_names()
- skimr for summary statistics
This package has a function called
skim that’s basically a better version of
summary. It gives you a tally of missing observations and plots a mini histogram of each numeric variable, which I think is super neat.
|Number of rows||150|
|Number of columns||5|
|Column type frequency:|
Variable type: factor
|Species||0||1||FALSE||3||set: 50, ver: 50, vir: 50|
Variable type: numeric
- broom for working with model output
Model output from functions like
lm are not that convenient to work with by default, which is what
broom fixes. The function
tidy turns the model output into a tibble and the function
augment adds relevant things like fitted values and residuals as columns to your data.
model_lm = lm(Sepal.Width ~ Petal.Length, iris) tidy(model_lm)
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.45 0.0761 45.4 9.02e-89 ## 2 Petal.Length -0.106 0.0183 -5.77 4.51e- 8
## # A tibble: 6 x 11 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species .fitted .resid ## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> ## 1 5.1 3.5 1.4 0.2 setosa 3.31 0.193 ## 2 4.9 3 1.4 0.2 setosa 3.31 -0.307 ## 3 4.7 3.2 1.3 0.2 setosa 3.32 -0.117 ## 4 4.6 3.1 1.5 0.2 setosa 3.30 -0.196 ## 5 5 3.6 1.4 0.2 setosa 3.31 0.293 ## 6 5.4 3.9 1.7 0.4 setosa 3.28 0.625 ## # … with 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, ## # .std.resid <dbl>
- here for paths
By default, R uses the user home directory as the working directory. This is generally not ideal, because it’s likely that all the files you want to read and write are in some local folder, not your home directory. To get around this, one way is to use
setwd at the top of your R script, i.e.
setwd("Users/myusername/myfolder/2019/project_a/"). But this can be problematic when you share your code, since it means your collaborator will have to change that line of code to match their directory structure. And it will break if you change any of your parent directory names, e.g. “2019” or “project_a.”
here solves this by automatically identifying which directory makes most sense as the root directory, for example, by finding an
.git file. Most of my code is either in a git repository or part of an R project or both, so this solution works well for me. As a quick example, you can write the file path as
here("data", "sample_data.csv"). I’ve usually found the former to be the more convenient syntax, since it’s more compatible with autocomplete, but there may be cases where the latter is also useful.
The package is perhaps less necessary for R markdown files because by default, an R markdown file uses the directory of the file itself as the base directory. It’s especially true if I’m working on an individual one-off Rmd file, since I may not have an
.git file in place. But that happens pretty rarely these days. If I’m working on an
Rmd file within an R project, I still prefer to use
here to be consistent.