Reordering geom_bar and geom_col by Count or Value

One of the things I’m always looking up with ggplot2 is how to reorder the bars in my bar charts by their length (i.e. the count/frequency or value, depending on whether you’re using geom_bar or geom_col). If you do a Google search, there are multiple different solutions, but I will document in this post what I’ve found to be the cleanest and simplest solution.1 Reordering geom_bar by count By default, the bars are arranged by the order (levels) of the factor variable.

Analyzing Metadata from Spotify Playlists

I recently came across this post about a Python module GSA that allows you to download metadata from Spotify playlists.1 This was great, because it was a perfect opportunity for me to test RStudio 1.4’s improved support for Python, which I had been excited to try out since I saw the news. Download Metadata The original blog post lays out the steps you need to install everything properly, so I won’t repeat them here.

Applying to Internships as a PhD Student

When I was applying to internships last fall, there was a lot I didn’t know about the process. So I thought I would go over my experience in this post, in hopes that I can demystify the process a little for future students. Before I begin, some context on my background: I am a 4th year biostatistics PhD student doing research in cancer and statistical genomics. I mostly applied to data scientist internships in tech/biotech companies and statistics/biostatistics internships in pharmaceutical companies, and this summer, I will be doing an internship in data science at Amazon.

Data Leakage Examples in Machine Learning

In my research, something I do fairly often is to build prediction models – given a set of variables (e.g. patient characteristics), we want to predict an outcome of interest (e.g. disease status). Typically, to prevent overfitting, we do cross-validation, so we have a separate training and test set, we train the model on the training set, and evaluate the performance of the model on the test set. This sounds like a simple practice to follow in theory, but as the scope of your data processing and feature selection steps increases, it becomes easy to accidentally violate the separation between the training and the test set and you may wind up borrowing information from the test set to train your model.

Introduction to SQL (for an R user)

Although SQL is commonly used in industry, it’s not something that’s often used or taught in academia. I learned it on my own a few years ago, but since I don’t use it regularly, it’s hard to retain. To resolve this, I’ve created the following guide for basic SQL commands along with their equivalents in R/dplyr. Hopefully, this will allow me to pick it back up again more quickly in the future.