Getting Started with R Markdown
I often use R Markdown for my research projects or any kind of data analysis (if you’re familiar with Python, they are similar to Jupyter notebooks). There are many advantages to using R Markdown over writing R scripts. One of the major ones is the ease with which it allows me to turn my work into something presentable for my advisor or other collaborators. By using R Markdown, I don’t have to track down a bunch of plots and files or really do any additional work to organize them. It also allows me to have the code accessible right there on the page, so that I can quickly check what exactly I did when asked. People seem to like how my R Markdown documents look and have asked me how I generated them, so I thought this would be a good thing to talk about on my blog.
I’m not going to write a tutorial about how to use R Markdown because there are already a ton of resources for that. Here are a couple, in increasing order of detail:
- The default R Markdown template tells you how to use it and is enough to get you started
- Jacolien van Rij’s tutorial, which is still brief but walks you through some more detail
- R Markdown cheat sheet, also accessible through RStudio -> Help -> Cheatsheets
- Yihui Xie’s definitive guide
But I will walk through some of the things that I typically do (or don’t do) when I work in R Markdown. First, my go-to YAML header (which controls the overall appearance of the document) looks like the following:
---
title: "Title"
author: "Albert Kuo"
date: "10/22/2019"
output:
html_document:
code_folding: "hide"
toc: TRUE
toc_float: TRUE
---
html_document
is the format of the output document1code_folding
keeps all the code chunks in the document but hides them by defaulttoc
adds a table of contentstoc_float
floats the table of contents so that it’s always accessible when you’re scrolling through the document, making it easy to navigate between sections
R Markdown also allows you to set a variety of options for each code chunk. For example, something I do pretty frequently is to set the options fig.show = "hold", out.width = "50%"
. It’s not usually intuitive to display plots side by side in R, but this allows me to do so for the plots generated by a code chunk.2
On the other hand, a common code chunk option that I actively avoid is cache = TRUE
. This option saves (or caches) the results of a code chunk and doesn’t rerun the code chunk again unless you change something in that code chunk. In theory, this is good for when you have something that takes a long time to run, such as fitting a model. But in practice, I’ve found that this easily leads to unintentional reproducibility issues. For example, I might change the input data (or other things upstream), but the cached code chunk won’t rerun the model, resulting in an output model that still corresponds to the old data.3
There are a lot of options in R Markdown and I still don’t know many of them. But if you use R, I encourage you to spend a bit of time learning how to use R Markdown because it’s not difficult to get started and there’s a good chance it’ll make your life easier. As you get more familiar, you can fine-tune and learn about things that work for you and things that don’t.
I would recommend html over the other supported formats because it’s the one that’s most tightly integrated with R Markdown.↩︎
However, I literally just discovered today the
plot_grid
function from the cowplot R package, which may be a better option. It looks easy to use and has the added benefit of automatically labelling your plots while arranging them.↩︎Update (12/15/19): There are ways around this, as Yihui Xie describes here, but I still think caching leads too easily to hidden errors.↩︎