5 R Packages to Simplify Your Data Science Workflow
I just finished my second year in the PhD program, which means 2 years of writing a lot of R code. Today, I wanted to share some useful (and perhaps lesser known) R packages that I use.
- pacman for loading packages
This package contains an awesome function called p_load
. I prefer the concise way it lets you load packages, as opposed to writing library(package)
over and over again. Moreover, p_load
automatically checks if you have installed the package and if you haven’t, installs it for you from CRAN. While this is the only function from pacman
that I consistently use, the package has a couple more intuitively named functions for managing packages (e.g. p_load_gh
, p_unload
, p_update
) that I can also see being useful, if only I could remember they exist.
library(pacman)
p_load(tidyverse, janitor, skimr, broom, here)
- janitor for cleaning data
This is a great package for cleaning and exploring data. Every time I read in a dataset, I almost always use clean_names
, which turns the dataset’s column names into the “snake_case” format, though there are options for other naming preferences. This ensures that all the columns are named in the same format, preventing future typos and headaches.
The package offers a couple additional functions that I also use from time to time. For removing duplicate entries, you can use get_dupes
, and for removing empty rows or columns, you can use remove_empty
. For a pipe-able version of table
, you can use tabyl
.
dt_raw = read_csv("test.csv")
dt = dt_raw %>% clean_names()
- skimr for summary statistics
This package has a function called skim
that’s basically a better version of summary
. It gives you a tally of missing observations and plots a mini histogram of each numeric variable, which I think is super neat.
data(iris)
skim(iris)
Name | iris |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
- broom for working with model output
Model output from functions like lm
are not that convenient to work with by default, which is what broom
fixes. The function tidy
turns the model output into a tibble and the function augment
adds relevant things like fitted values and residuals as columns to your data.
model_lm = lm(Sepal.Width ~ Petal.Length, iris)
tidy(model_lm)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.45 0.0761 45.4 9.02e-89
## 2 Petal.Length -0.106 0.0183 -5.77 4.51e- 8
head(augment(model_lm, iris))
## # A tibble: 6 x 11
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species .fitted .resid
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa 3.31 0.193
## 2 4.9 3 1.4 0.2 setosa 3.31 -0.307
## 3 4.7 3.2 1.3 0.2 setosa 3.32 -0.117
## 4 4.6 3.1 1.5 0.2 setosa 3.30 -0.196
## 5 5 3.6 1.4 0.2 setosa 3.31 0.293
## 6 5.4 3.9 1.7 0.4 setosa 3.28 0.625
## # … with 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>
- here for paths
By default, R uses the user home directory as the working directory. This is generally not ideal, because it’s likely that all the files you want to read and write are in some local folder, not your home directory. To get around this, one way is to use setwd
at the top of your R script, i.e. setwd("Users/myusername/myfolder/2019/project_a/")
. But this can be problematic when you share your code, since it means your collaborator will have to change that line of code to match their directory structure. And it will break if you change any of your parent directory names, e.g. “2019” or “project_a.”
The function here
solves this by automatically identifying which directory makes most sense as the root directory, for example, by finding an .Rproj
or .git
file. Most of my code is either in a git repository or part of an R project or both, so this solution works well for me. As a quick example, you can write the file path as here("./data/sample_data.csv")
or here("data", "sample_data.csv")
. I’ve usually found the former to be the more convenient syntax, since it’s more compatible with autocomplete, but there may be cases where the latter is also useful.
The package is perhaps less necessary for R markdown files because by default, an R markdown file uses the directory of the file itself as the base directory. It’s especially true if I’m working on an individual one-off Rmd file, since I may not have an .Rproj
or .git
file in place. But that happens pretty rarely these days. If I’m working on an Rmd
file within an R project, I still prefer to use here
to be consistent.