I just finished my second year in the PhD program, which means 2 years of writing a lot of R code. Today, I wanted to share some useful (and perhaps lesser known) R packages that I use.
pacmanfor loading packages:
This package contains an awesome function called
p_load. I prefer the concise way it lets you load packages, as opposed to writing
library(package) over and over again. Moreover,
p_load automatically checks if you have installed the package and if you haven’t, installs it for you from CRAN. While this is the only function from
pacman that I consistently use, the package has a couple more intuitively named functions for managing packages (e.g.
p_update) that I can also see being useful, if only I could remember they exist.
library(pacman) p_load(tidyverse, janitor, skimr, broom, here)
janitorfor cleaning data:
This is a great package for cleaning and exploring data. Every time I read in a dataset, I almost always use
clean_names, which turns the dataset’s column names into the “snake_case” format, though there are options for other naming preferences. This ensures that all the columns are named in the same format, preventing future typos and headaches.
The package offers a couple additional functions that I also use from time to time. For removing duplicate entries, you can use
get_dupes, and for removing empty rows or columns, you can use
remove_empty. For a pipe-able version of
table, you can use
dt_raw = read_csv("test.csv") dt = dt_raw %>% clean_names()
skimrfor summary statistics:
This package has a function called
skim that’s basically a better version of
summary. It gives you a tally of missing observations and plots a mini histogram of each numeric variable, which I think is super neat.
## Skim summary statistics ## n obs: 150 ## n variables: 5 ## ## ── Variable type:factor ───────────────────────────────────────────────── ## variable missing complete n n_unique top_counts ## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 ## ordered ## FALSE ## ## ── Variable type:numeric ──────────────────────────────────────────────── ## variable missing complete n mean sd p0 p25 p50 p75 p100 ## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ## hist ## ▇▁▁▂▅▅▃▁ ## ▇▁▁▅▃▃▂▂ ## ▂▇▅▇▆▅▂▂ ## ▁▂▅▇▃▂▁▁
broomfor working with model output:
Model output from functions like
lm are not that convenient to work with by default, which is what
broom fixes. The function
tidy turns the model output into a tibble and the function
augment adds relevant things like fitted values and residuals as columns to your data.
model_lm = lm(Sepal.Width ~ Petal.Length, iris) tidy(model_lm)
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.45 0.0761 45.4 9.02e-89 ## 2 Petal.Length -0.106 0.0183 -5.77 4.51e- 8
## # A tibble: 6 x 12 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species .fitted .se.fit ## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> ## 1 5.1 3.5 1.4 0.2 setosa 3.31 0.0540 ## 2 4.9 3 1.4 0.2 setosa 3.31 0.0540 ## 3 4.7 3.2 1.3 0.2 setosa 3.32 0.0554 ## 4 4.6 3.1 1.5 0.2 setosa 3.30 0.0525 ## 5 5 3.6 1.4 0.2 setosa 3.31 0.0540 ## 6 5.4 3.9 1.7 0.4 setosa 3.28 0.0497 ## # … with 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, ## # .cooksd <dbl>, .std.resid <dbl>
By default, R uses the user home directory as the working directory. This is generally not ideal, because it’s likely that all the files you want to read and write are in some local folder, not your home directory. To get around this, one way is to use
setwd at the top of your R script, i.e.
setwd("Users/myusername/myfolder/2019/project_a/"). But this can be problematic when you share your code, since it means your collaborator will have to change that line of code to match their directory structure. And it will break if you change any of your parent directory names, e.g. “2019” or “project_a.” The function
here solves this by automatically identifying which directory makes most sense as the root directory, for example, by finding an
.git file. Most of my code is either in a git repository or part of an R project or both, so this solution works well for me.
I’ve found that
here is not necessary for R markdown files though, since by default, an R markdown file uses the directory of the file itself as the base directory. Since I prefer to write my analyses and reports in R markdown in general, I don’t use
here quite as much as you might expect.