5 R Packages to Simplify Your Data Science Workflow

I just finished my second year in the PhD program, which means 2 years of writing a lot of R code. Today, I wanted to share some useful (and perhaps lesser known) R packages that I use.

  1. pacman for loading packages

This package contains an awesome function called p_load. I prefer the concise way it lets you load packages, as opposed to writing library(package) over and over again. Moreover, p_load automatically checks if you have installed the package and if you haven’t, installs it for you from CRAN. While this is the only function from pacman that I consistently use, the package has a couple more intuitively named functions for managing packages (e.g. p_load_gh, p_unload, p_update) that I can also see being useful, if only I could remember they exist.

library(pacman)
p_load(tidyverse, janitor, skimr, broom, here)
  1. janitor for cleaning data

This is a great package for cleaning and exploring data. Every time I read in a dataset, I almost always use clean_names, which turns the dataset’s column names into the “snake_case” format, though there are options for other naming preferences. This ensures that all the columns are named in the same format, preventing future typos and headaches.

The package offers a couple additional functions that I also use from time to time. For removing duplicate entries, you can use get_dupes, and for removing empty rows or columns, you can use remove_empty. For a pipe-able version of table, you can use tabyl.

dt_raw = read_csv("test.csv")
dt = dt_raw %>% clean_names()
  1. skimr for summary statistics

This package has a function called skim that’s basically a better version of summary. It gives you a tally of missing observations and plots a mini histogram of each numeric variable, which I think is super neat.

data(iris)
skim(iris)
Table 1: Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃
  1. broom for working with model output

Model output from functions like lm are not that convenient to work with by default, which is what broom fixes. The function tidy turns the model output into a tibble and the function augment adds relevant things like fitted values and residuals as columns to your data.

model_lm = lm(Sepal.Width ~ Petal.Length, iris)
tidy(model_lm)
## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     3.45     0.0761     45.4  9.02e-89
## 2 Petal.Length   -0.106    0.0183     -5.77 4.51e- 8
head(augment(model_lm, iris))
## # A tibble: 6 x 11
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species .fitted .resid
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>  <dbl>
## 1          5.1         3.5          1.4         0.2 setosa     3.31  0.193
## 2          4.9         3            1.4         0.2 setosa     3.31 -0.307
## 3          4.7         3.2          1.3         0.2 setosa     3.32 -0.117
## 4          4.6         3.1          1.5         0.2 setosa     3.30 -0.196
## 5          5           3.6          1.4         0.2 setosa     3.31  0.293
## 6          5.4         3.9          1.7         0.4 setosa     3.28  0.625
## # … with 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## #   .std.resid <dbl>
  1. here for paths

By default, R uses the user home directory as the working directory. This is generally not ideal, because it’s likely that all the files you want to read and write are in some local folder, not your home directory. To get around this, one way is to use setwd at the top of your R script, i.e. setwd("Users/myusername/myfolder/2019/project_a/"). But this can be problematic when you share your code, since it means your collaborator will have to change that line of code to match their directory structure. And it will break if you change any of your parent directory names, e.g. “2019” or “project_a.”

The function here solves this by automatically identifying which directory makes most sense as the root directory, for example, by finding an .Rproj or .git file. Most of my code is either in a git repository or part of an R project or both, so this solution works well for me. As a quick example, you can write the file path as here("./data/sample_data.csv") or here("data", "sample_data.csv"). I’ve usually found the former to be the more convenient syntax, since it’s more compatible with autocomplete, but there may be cases where the latter is also useful.

The package is perhaps less necessary for R markdown files because by default, an R markdown file uses the directory of the file itself as the base directory. It’s especially true if I’m working on an individual one-off Rmd file, since I may not have an .Rproj or .git file in place. But that happens pretty rarely these days. If I’m working on an Rmd file within an R project, I still prefer to use here to be consistent.