As someone who watches basketball and enjoys sports analytics (see my previous post on estimating win probabilities live during an NBA game), I’ve been a fan of FiveThirtyEight’s NBA prediction models, which are always fun to follow and interesting to read about.

## Calibration vs Accuracy

Recently, I came across an article by FiveThirtyEight in which they self-evaluated their prediction models. The primary metric1 they use to evaluate their model is calibration, that is, whether their forecasted probabilities match up with the actual probabilities. For example, if they predict that a team is going to win with 70% probability, is it true that such teams end up winning 70% of the time in reality? In the article, they show that their models are generally well-calibrated.

But is calibration really the metric we care about? It seems to me like what’s more important to get right is the discriminative power of the model. For instance, let’s consider the following two models:

1. A model that predicts a win probability of 50% for all teams
2. A model that correctly predicts a win probability of 100% for all winning teams and a win probability of 0% for all losing teams

Both of these models exhibit perfect calibration, but clearly, the second model is much more useful than the first model. The second model correctly predicts every single game outcome, whereas the first model is equivalent to a random guess of who the winner will be. In other words, we need to consider the accuracy of the prediction – what % of game outcomes can the model correctly predict? – to properly evaluate the model.

Moreover, we will want to look at the relative accuracy of the model, meaning how good the model is relative to other reasonable baseline models we might use. Given the increasing complexity of FiveThirtyEight’s model, we would ideally expect it to be more accurate than a simpler, less sophisticated model.

## My Evaluation Results

Since FiveThirtyEight has generously made their data available on github, I was able to evaluate their NBA prediction models for the playoffs from 2016-2020, keeping in mind the points I made above (code available here).

I compared the accuracy of FiveThirtyEight’s predictions to a baseline model that only uses the team seeds to predict the winner of each playoff series.2 For example, a team seeded 2nd will be predicted to win the playoff series against a team seeded 7th. The results of the evaluation for both models are summarized in the table3 below for each season.

538 Model
Baseline Model
Season Total # of series # of correct predictions % correctly predicted # of correct predictions % correctly predicted
2015-16 15 12 80% 12 80%
2016-17 15 12 80% 13 87%
2017-18 15 10 67% 10 67%
2018-19 15 13 87% 12 80%
2019-20 15 10 67% 10 67%
Overall 75 57 76% 57 76%

In the last row of the table, we see that the overall accuracy of FiveThirtyEight’s model is 76%, the same as the overall accuracy of our baseline model! The two models correctly predicted the same number of series in every season, except in 2017 and 2019, when they were off by one. This is pretty remarkable if you remember that the baseline model only uses the team seeds to predict the series outcome.

The point of this post isn’t to say that FiveThirtyEight’s model is necessarily “bad.” Rather, I think it’s an example of how hard it can be to improve on a simple baseline model, possibly due to overfitting problems with more complex models. This has been shown before in other fields too – for example, this paper by Zhao, et al. (2014), in which they show that a simple method does just as well as more complicated methods for predicting patient outcomes from gene expression data.

But it may also have something to do with the stochastic nature of the game, aspects of which may not be possible to measure quantitatively. We know, intuitively, that when two teams are well-matched, the game becomes a toss-up on who will win.4 In other words, there may be a ceiling to how well any statistical model can predict the outcome of the game.5 Maybe that’s good news for basketball fans – after all, the sport would become rather boring if we could predict every single outcome beforehand.

1. They also evaluate their model using “Brier skill scores,” which they state is some sort of (unknown) modification of Brier scores and I’m assuming it probably does reflect model accuracy in some way. But the way they’ve presented it is not very interpretable because they compare their model’s Brier score to a naive model that predicts 50% for everything, which is of course a useless model. So all that the higher Brier score tells us is that it does better than a random guess.↩︎

2. Since seedings are determined within each conference, the only scenario in which team seedings may not be enough information to predict the winner is if the NBA finals feature two teams with the same seed from the East and West conferences. From 2016 to 2020, this only happened once, during the 2016 final, when the Warriors and the Cavaliers were both the number one seed of their respective conference. To break this tie, I used the W-L record during the regular season to predict the winner.↩︎

3. If you’re interested in how I made the table, check out the R packages kableExtra and formattable.↩︎

4. Unsurprisingly, I found that both the FiveThirtyEight model and the baseline model predicted less accurately when the difference between the team seedings was smaller.↩︎

5. Whether an accuracy of 76% is really where the ceiling is, however, is a different story.↩︎