Analyzing Metadata from Spotify Playlists

2022-01-03

project

I recently came across this post about a Python module GSA that allows you to download metadata from Spotify playlists.1 This was great, because it was a perfect opportunity for me to test RStudio 1.4’s improved support for Python, which I had been excited to try out since I saw the news. Download Metadata The original blog post lays out the steps you need to install everything properly, so I won’t repeat them here.

Applying to Internships as a PhD Student

2021-05-31

advice

When I was applying to internships last fall, there was a lot I didn’t know about the process. So I thought I would go over my experience in this post, in hopes that I can demystify the process a little for future students. Before I begin, some context on my background: I am a 4th year biostatistics PhD student doing research in cancer and statistical genomics. I mostly applied to data scientist internships in tech/biotech companies and statistics/biostatistics internships in pharmaceutical companies, and this summer, I will be doing an internship in data science at Amazon.

Data Leakage Examples in Machine Learning

2021-03-09

statistics

In my research, something I do fairly often is to build prediction models – given a set of variables (e.g. patient characteristics), we want to predict an outcome of interest (e.g. disease status). Typically, to prevent overfitting, we do cross-validation, so we have a separate training and test set, we train the model on the training set, and evaluate the performance of the model on the test set. This sounds like a simple practice to follow in theory, but as the scope of your data processing and feature selection steps increases, it becomes easy to accidentally violate the separation between the training and the test set and you may wind up borrowing information from the test set to train your model.

Introduction to SQL (for an R user)

2020-11-28

coding

Although SQL is commonly used in industry, it’s not something that’s often used or taught in academia. I learned it on my own a few years ago, but since I don’t use it regularly, it’s hard to retain. To resolve this, I’ve created the following guide for basic SQL commands along with their equivalents in R/dplyr. Hopefully, this will allow me to pick it back up again more quickly in the future.

Probability of Winning an NBA Game: A Minute-by-Minute Breakdown

2020-11-20

project

Your NBA team is down 17 points and there are only 8 minutes left in the game. What is the probability that they pull a comeback and win the game? It’s possible to answer this using historical data (i.e. in the past, how many teams have won after being in this situation). Given that sports commentators love to provide super specific, seemingly arbitrary statistics (e.g. no team has won Game 7 of an ECF after losing Game 6 by more than 10 points), I knew that I should be able to access the relevant data somewhere and calculate these probabilities.