When I was applying to internships last fall, there was a lot I didn’t know about the process. So I thought I would go over my experience in this post, in hopes that I can demystify the process a little for future students. Before I begin, some context on my background: I am a 4th year biostatistics PhD student doing research in cancer and statistical genomics. I mostly applied to data scientist internships in tech/biotech companies and statistics/biostatistics internships in pharmaceutical companies, and this summer, I will be doing an internship in data science at Amazon.
In my research, something I do fairly often is to build prediction models – given a set of variables (e.g. patient characteristics), we want to predict an outcome of interest (e.g. disease status). Typically, to prevent overfitting, we do cross-validation, so we have a separate training and test set, we train the model on the training set, and evaluate the performance of the model on the test set. This sounds like a simple practice to follow in theory, but as the scope of your data processing and feature selection steps increases, it becomes easy to accidentally violate the separation between the training and the test set and you may wind up borrowing information from the test set to train your model.
Although SQL is commonly used in industry, it’s not something that’s often used or taught in academia. I learned it on my own a few years ago, but since I don’t use it regularly, it’s hard to retain. To resolve this, I’ve created the following guide for basic SQL commands along with their equivalents in R/dplyr. Hopefully, this will allow me to pick it back up again more quickly in the future.
Your NBA team is down 17 points and there are only 8 minutes left in the game. What is the probability that they pull a comeback and win the game? It’s possible to answer this using historical data (i.e. in the past, how many teams have won after being in this situation). Given that sports commentators love to provide super specific, seemingly arbitrary statistics (e.g. no team has won Game 7 of an ECF after losing Game 6 by more than 10 points), I knew that I should be able to access the relevant data somewhere and calculate these probabilities.
It’s application season for graduate admissions again! As a current PhD student, I thought I would share some advice for prospective students. I’d previously written on whether you should get a PhD. In this post, I will talk about things you can do to prepare for a biostatistics PhD and the application process. As with any advice I give on this blog, it is based on my personal experience – I was a math and statistics major in undergrad at UChicago and I’m now a biostatistics PhD student at Johns Hopkins – so your mileage may vary.