“Debugging” Your Analysis as a Data Scientist
Recently, I’ve been working on revisions for a paper1 from my PhD and it reminded me of a valuable lesson I learned while working on some tricky statistical analysis for that project.
The idea is what I think of as “debugging” your analysis.2 Debugging in a programming context refers to the process of identifying and fixing bugs (errors in your code).
But a similar concept of “debugging” also applies to data science. When there is something that looks off in your analysis or results, you want to be able to systematically identify, track down, and test the possible causes. It could be due to a programming bug in your code, but it could also be a data quality issue, a misunderstanding about what the data is representing, or a violation of a statistical assumption in your methods. Or, as may occasionally happen in research, maybe it’s not a “bug” at all, and maybe the field just has a misconception about what you should expect to see. But before you come to this latter (rarer) conclusion, you’ll want to rule out any likely sources of mistakes.
A schema for this process can be broken down into three major steps.
Step-by-step “debugging” process
1. Evaluate your results
Ask yourself, is this what I expect to see? To be able to answer this question well, you’ll likely require some knowledge of the relevant domain. If you’re new to the field, talk to someone that has more expertise. During my PhD, I was working on research questions in statistical genomics/oncology and my mentors and collaborators, who have worked on many related questions and datasets, offered me a lot of guidance on what is likely and what are common pitfalls.
Likewise, at my current job at Amazon, I’m working in a completely different field (supply chain), but the idea is the same – my colleagues have an understanding of Amazon’s business, so they often already have some intuition for what results are reasonable or not. This kind of intuition and knowledge tends to build up with experience, as you work on more projects and encounter more facets of the domain.
2. Identify the most likely sources of error
If you’ve found that something may be off in your analysis, you’ll want to brainstorm about what are the most likely sources and create a rough list of top candidates. Sometimes, this may be something that someone else can already identify for you, if they’ve encountered it before. Otherwise, I would start with which aspects of your analysis may be unfamiliar to you and therefore more prone to mistakes. Are you working with a new dataset? Using a new method? Running the code for the first time?
As an example, a quirk of working with gene expression datasets is that they’re usually stored in a matrix where the columns are the cells (samples) and the rows are the genes (features). This is the reverse of how most datasets are stored, where the rows are the samples/observations and the columns are the variables/features, so someone new to the field may get tripped up by this.
Another simple example is that the
LogisticRegression function in Python’s
sklearn implements an L2 regularization by default, which you may not have guessed if you didn’t go look up the documentation.3 But once you encounter this issue once, then you will know about it and you can help others who run into the same issue.
3. Confirm or rule out each candidate source
The next step is to test out each possible source. Verify whether your intended analysis actually matches up with the analysis you did. You will want to isolate and hone in on each possible candidate as cleanly as possible, so you can eliminate them systematically until you find a possible problem.
However, it’s often the case that the most challenging “debugging” issues are ones where you don’t really know the specific candidate causes. A useful next step in this case is to figure out which general direction you want to investigate further. For example, are the unexpected results of your statistical test because (a) something weird is going on with the empirical data (often a problem in genomics, due to noise in the data collection) or (b) something weird is going on with the methods (also often a problem in genomics, due to properties like sparsity and high-dimensionality). Well, one way you can figure it out is to run the same analysis on simulated data, where you know and can control what the results are supposed to be.4 If you get what you expect with the simulated data, then you may start looking more closely at (a), but if you don’t get what you expect, then it’s very likely (b).
I hope this framework was a little bit helpful for thinking about how to “debug” your analysis. The actual work is often far messier than following steps 1-2-3 in a linear path like I’ve presented here. You might go between steps 2 and 3 many times in a spiral of confusion, especially when there are many possible sources of doubt. But this is where experience (either yours or that of your colleagues) will help a lot, as you can then hone in on the likely issues more quickly.
Here is a link to the preprint if you’re interested!↩︎
Or what one of my mentors described as “peeling the onion.”↩︎
It’s an absolutely crazy default setting in my opinion.↩︎
This is in fact what we did as part of the paper that inspired this post, though it took me many months of trial and error and probably hundreds of plots!↩︎