Tips for a First-Time Teaching Assistant

2022-01-17

advice

For the last 3 years, I’ve been teaching the lab sections as a teaching assistant for the statistics sequence at the Johns Hopkins Bloomberg School of Public Health. The course is geared towards master’s level students in public health, though some PhD students and doctors/members of the JHU Medicine community also take it. As you can imagine, there can be a range of familiarity in the classroom with regard to what students know about stats, math, and coding.

Reordering geom_bar and geom_col by Count or Value

2022-01-06

coding

One of the things I’m always looking up with ggplot2 is how to reorder the bars in my bar charts by their length (i.e. the count/frequency or value, depending on whether you’re using geom_bar or geom_col). If you do a Google search, there are multiple different solutions, but I will document in this post what I’ve found to be the cleanest and simplest solution.1 Reordering geom_bar by count By default, the bars are arranged by the order (levels) of the factor variable.

Analyzing Metadata from Spotify Playlists

2022-01-03

project

I recently came across this post about a Python module GSA that allows you to download metadata from Spotify playlists.1 This was great, because it was a perfect opportunity for me to test RStudio 1.4’s improved support for Python, which I had been excited to try out since I saw the news. Download Metadata The original blog post lays out the steps you need to install everything properly, so I won’t repeat them here.

Applying to Internships as a PhD Student

2021-05-31

advice

When I was applying to internships last fall, there was a lot I didn’t know about the process. So I thought I would go over my experience in this post, in hopes that I can demystify the process a little for future students. Before I begin, some context on my background: I am a 4th year biostatistics PhD student doing research in cancer and statistical genomics. I mostly applied to data scientist internships in tech/biotech companies and statistics/biostatistics internships in pharmaceutical companies, and this summer, I will be doing an internship in data science at Amazon.

Data Leakage Examples in Machine Learning

2021-03-09

statistics

In my research, something I do fairly often is to build prediction models – given a set of variables (e.g. patient characteristics), we want to predict an outcome of interest (e.g. disease status). Typically, to prevent overfitting, we do cross-validation, so we have a separate training and test set, we train the model on the training set, and evaluate the performance of the model on the test set. This sounds like a simple practice to follow in theory, but as the scope of your data processing and feature selection steps increases, it becomes easy to accidentally violate the separation between the training and the test set and you may wind up borrowing information from the test set to train your model.