A Few of My Favorite R Packages
Jun 28, 2019

I just finished my second year in the PhD program, which means 2 years of writing a lot of R code. Today, I wanted to share some useful (and perhaps lesser known) R packages that I use.

1. pacman for loading packages:

This package contains an awesome function called p_load. I prefer the concise way it lets you load packages, as opposed to writing library(package) over and over again. Moreover, p_load automatically checks if you have installed the package and if you haven’t, installs it for you from CRAN. While this is the only function from pacman that I consistently use, the package has a couple more intuitively named functions for managing packages (e.g. p_load_gh, p_unload, p_update) that I can also see being useful, if only I could remember they exist.

library(pacman)
p_load(tidyverse, janitor, skimr, broom, here)
1. janitor for cleaning data:

This is a great package for cleaning and exploring data. Every time I read in a dataset, I almost always use clean_names, which turns the dataset’s column names into the “snake_case” format, though there are options for other naming preferences. This ensures that all the columns are named in the same format, preventing future typos and headaches.

The package offers a couple additional functions that I also use from time to time. For removing duplicate entries, you can use get_dupes, and for removing empty rows or columns, you can use remove_empty. For a pipe-able version of table, you can use tabyl.

dt_raw = read_csv("test.csv")
dt = dt_raw %>% clean_names()
1. skimr for summary statistics:

This package has a function called skim that’s basically a better version of summary. It gives you a tally of missing observations and plots a mini histogram of each numeric variable, which I think is super neat.

data(iris)
skim(iris)
## Skim summary statistics
##  n obs: 150
##  n variables: 5
##
## ── Variable type:factor ─────────────────────────────────────────────────
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
##
## ── Variable type:numeric ────────────────────────────────────────────────
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4
##      hist
##  ▇▁▁▂▅▅▃▁
##  ▇▁▁▅▃▃▂▂
##  ▂▇▅▇▆▅▂▂
##  ▁▂▅▇▃▂▁▁
1. broom for working with model output:

Model output from functions like lm are not that convenient to work with by default, which is what broom fixes. The function tidy turns the model output into a tibble and the function augment adds relevant things like fitted values and residuals as columns to your data.

model_lm = lm(Sepal.Width ~ Petal.Length, iris)
tidy(model_lm)
## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     3.45     0.0761     45.4  9.02e-89
## 2 Petal.Length   -0.106    0.0183     -5.77 4.51e- 8
head(augment(model_lm, iris))
## # A tibble: 6 x 12
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species .fitted .se.fit
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>   <dbl>
## 1          5.1         3.5          1.4         0.2 setosa     3.31  0.0540
## 2          4.9         3            1.4         0.2 setosa     3.31  0.0540
## 3          4.7         3.2          1.3         0.2 setosa     3.32  0.0554
## 4          4.6         3.1          1.5         0.2 setosa     3.30  0.0525
## 5          5           3.6          1.4         0.2 setosa     3.31  0.0540
## 6          5.4         3.9          1.7         0.4 setosa     3.28  0.0497
## # … with 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>
1. here for paths:

By default, R uses the user home directory as the working directory. This is generally not ideal, because it’s likely that all the files you want to read and write are in some local folder, not your home directory. To get around this, one way is to use setwd at the top of your R script, i.e. setwd("Users/myusername/myfolder/2019/project_a/"). But this can be problematic when you share your code, since it means your collaborator will have to change that line of code to match their directory structure. And it will break if you change any of your parent directory names, e.g. “2019” or “project_a.” The function here solves this by automatically identifying which directory makes most sense as the root directory, for example, by finding an .Rproj or .git file. Most of my code is either in a git repository or part of an R project or both, so this solution works well for me.

I’ve found that here is not necessary for R markdown files though, since by default, an R markdown file uses the directory of the file itself as the base directory. Since I prefer to write my analyses and reports in R markdown in general, I don’t use here quite as much as you might expect.