Getting Started with R Markdown

I often use R Markdown for my research projects or any kind of data analysis (if you’re familiar with Python, they are similar to Jupyter notebooks). There are many advantages to using R Markdown over writing R scripts. One of the major ones is the ease with which it allows me to turn my work into something presentable for my advisor or other collaborators. By using R Markdown, I don’t have to track down a bunch of plots and files or really do any additional work to organize them. It also allows me to have the code accessible right there on the page, so that I can quickly check what exactly I did when asked. People seem to like how my R Markdown documents look and have asked me how I generated them, so I thought this would be a good thing to talk about on my blog.

I’m not going to write a tutorial about how to use R Markdown because there are already a ton of resources for that. Here are a couple, in increasing order of detail:

The default R Markdown template tells you how to use it and is enough to get you started
Jacolien van Rij’s tutorial, which is still brief but walks you through some more detail
R Markdown cheat sheet, also accessible through RStudio -> Help -> Cheatsheets
Yihui Xie’s definitive guide

But I will walk through some of the things that I typically do (or don’t do) when I work in R Markdown. First, my go-to YAML header (which controls the overall appearance of the document) looks like the following:

---
title: "Title"
author: "Albert Kuo"
date: "10/22/2019"
output: 
  html_document:
    code_folding: "hide"
    toc: TRUE
    toc_float: TRUE
---

html_document is the format of the output document¹
code_folding keeps all the code chunks in the document but hides them by default
toc adds a table of contents
toc_float floats the table of contents so that it’s always accessible when you’re scrolling through the document, making it easy to navigate between sections

R Markdown also allows you to set a variety of options for each code chunk. For example, something I do pretty frequently is to set the options fig.show = "hold", out.width = "50%". It’s not usually intuitive to display plots side by side in R, but this allows me to do so for the plots generated by a code chunk.²

On the other hand, a common code chunk option that I actively avoid is cache = TRUE. This option saves (or caches) the results of a code chunk and doesn’t rerun the code chunk again unless you change something in that code chunk. In theory, this is good for when you have something that takes a long time to run, such as fitting a model. But in practice, I’ve found that this easily leads to unintentional reproducibility issues. For example, I might change the input data (or other things upstream), but the cached code chunk won’t rerun the model, resulting in an output model that still corresponds to the old data.³

There are a lot of options in R Markdown and I still don’t know many of them. But if you use R, I encourage you to spend a bit of time learning how to use R Markdown because it’s not difficult to get started and there’s a good chance it’ll make your life easier. As you get more familiar, you can fine-tune and learn about things that work for you and things that don’t.

I would recommend html over the other supported formats because it’s the one that’s most tightly integrated with R Markdown.↩︎
However, I literally just discovered today the plot_grid function from the cowplot R package, which may be a better option. It looks easy to use and has the added benefit of automatically labelling your plots while arranging them.↩︎
Update (12/15/19): There are ways around this, as Yihui Xie describes here, but I still think caching leads too easily to hidden errors.↩︎