Data Leakage Examples in Machine Learning

2021-03-09

statistics

In my research, something I do fairly often is to build prediction models – given a set of variables (e.g. patient characteristics), we want to predict an outcome of interest (e.g. disease status). Typically, to prevent overfitting, we do cross-validation, so we have a separate training and test set, we train the model on the training set, and evaluate the performance of the model on the test set. This sounds like a simple practice to follow in theory, but as the scope of your data processing and feature selection steps increases, it becomes easy to accidentally violate the separation between the training and the test set and you may wind up borrowing information from the test set to train your model.

Introduction to SQL (for an R user)

2020-11-28

coding

Although SQL is commonly used in industry, it’s not something that’s often used or taught in academia. I learned it on my own a few years ago, but since I don’t use it regularly, it’s hard to retain. To resolve this, I’ve created the following guide for basic SQL commands along with their equivalents in R/dplyr. Hopefully, this will allow me to pick it back up again more quickly in the future.

Probability of Winning an NBA Game: A Minute-by-Minute Breakdown

2020-11-20

project

Your NBA team is down 17 points and there are only 8 minutes left in the game. What is the probability that they pull a comeback and win the game? It’s possible to answer this using historical data (i.e. in the past, how many teams have won after being in this situation). Given that sports commentators love to provide super specific, seemingly arbitrary statistics (e.g. no team has won Game 7 of an ECF after losing Game 6 by more than 10 points), I knew that I should be able to access the relevant data somewhere and calculate these probabilities.

Advice for Prospective PhD Students (in biostatistics)

2020-10-30

advice

It’s application season for graduate admissions again! As a current PhD student, I thought I would share some advice for prospective students. I’d previously written on whether you should get a PhD. In this post, I will talk about things you can do to prepare for a biostatistics PhD and the application process. As with any advice I give on this blog, it is based on my personal experience – I was a math and statistics major in undergrad at UChicago and I’m now a biostatistics PhD student at Johns Hopkins – so your mileage may vary.

What is the Effect of Increasing Voter Turnout in the U.S.?

2020-08-28

project

On campaign trails across the U.S., the same message is often repeated: vote! Their goal is to encourage more people to vote, especially the people who are likely to vote for them. But which party benefits more from increasing overall voter turnout? The conventional wisdom today is that it benefits the Democratic party more than the Republican party. This is based on the working knowledge that young people of color are believed to have lower voting turnouts and are also more likely to vote for Democratic candidates.