Bayesian statistics for confused data scientists

An intuitive guide to understanding the fundamental differences between Bayesian and Frequentist statistics, breaking down complex concepts like priors and posteriors for data science practitioners.
Bayesian statistics for confused data scientists
It’s the third time I’ve fallen into the Bayesian rabbit hole. It always goes like this: I find some cool article about it, it feels like magic, whoever is writing about it is probably a little smug about how much cooler than frequentism it is (and I don’t blame them), and yet I still leave confused about what exactly is happening. This post is a cathartic attempt to force myself into making sense out of everything I’ve read so far, and hopefully it will also be useful to the legions out there who surely feel the same way as I do.1
Bayesian vs. frequentist statistics: the story of a feud
The frequentist approach is so dominant that when you learn statistics, it’s not named as such, it just is statistics. The Bayesian approach, on the other hand, is this weird niche that only a few people seem reeeeally into. It’s the Haskell of statistics. And just like its programming counterpart, this little tribe of Bayesians is actually right to love it so much.
At its heart, the difference between Bayesian and frequentist statistics is about the philosophical role that probability plays into the framework. In both frameworks, you have parameters (usually some unknown quantities which determine how things behave) and you have data (or observations), which are things you’ve measured.
A simple example would be if you roll a die a bunch of times. The parameter here is the number of faces (intuitively, we all know the more faces, the less likely a given face will appear), while the data is just the collected faces you see as you roll the die. Let me tell you right now that for my example to make any sense whatsoever, you have to make the scenario a bit more convoluted. So let’s say you’re playing DnD or some dice-based game, but your game master is rolling the die behind a curtain. So you don’t know how many faces the die has (maybe the game master is lying to you, maybe not), all you know is it’s a die, and the values that are rolled. A frequentist in this situation would tell you the parameter is fixed (although unknown), and the data is just randomly drawn from the uniform distribution . A Bayesian, on the other hand, would say that the parameter is itself a random variable drawn from some other distribution , with its own uncertainty, and that the data tells you what that distribution truly is.
I’m going to pause here for you to take a breath and yell at your screen that it makes no sense. Of course, the number of faces is fixed, it’s a die! What Bayesian statistics quantifies with the distribution is not how random the number of faces is, but how uncertain you are about it. This is the crucial difference and the whole reason why Bayesian statistics is so powerful. In frequentist approaches, uncertainty is often an afterthought, something you just tack on using some sample-to-population formula after the fact. Maybe if you feel fancy you use some bootstrapping method. And whatever interval you get from this is a confidence interval, it doesn’t tell you how likely the parameter is to be within, but how often the intervals constructed this way will contain the parameter. This is often a confusing point which makes confidence intervals a very misunderstood concept. In Bayesian statistics, on the other hand, the parameter is not a point but a distribution. The spread of that distribution already accounts for the uncertainty you have about the parameter, and the credible interval you get from it actually tells you how likely the parameter is to be within it.
On a more mathematical note, the difference between the two approaches lies within Bayes’ famous theorem which tells you how conditional probabilities relate to each other:
That’s it! If you take this equation and you stick in it the parameters and the data , you get , which is the cornerstone of Bayesian inference. This may not seem immediately useful, but it truly is. Remember that is just a bunch of observations, while is what parametrizes your model. So , the likelihood, is just how likely it is to see the data you have for a given realization of the parameters. Meanwhile, , the prior, is some intuition you have about what the parameters should look like. I will get back to this, but it’s usually something you choose. Finally, you can just think of as a normalization constant, and one of the main things people do in Bayesian inference is literally whatever they can so they don’t have to compute it! The goal is of course to estimate the posterior distribution which tells you what distribution the parameter takes. The posterior distribution is useful because
- it gives you a clear idea of your uncertainty on the model parametrization,
- you can use it to build the posterior predictive distributionwhere is new data.
Let’s get back to our little die rolling example and say you observe the following values with the given frequencies:
| Value | Count | Frequency | |---|---|---| | 1 | 2 | 0.250 | | 2 | 1 | 0.125 | | 3 | 2 | 0.250 | | 4 | 3 | 0.375 |
If you were a frequentist, you would look for the maximum likelihood estimation of the number of faces, which is essentially the maximization of the term introduced above. Let’s take a second to go through this: if your die has faces, then and the probability to observe exactly this data is
This is clearly maximal when is the smallest value possible, which here is 4 (since it’s not possible to draw a 4 with a 3-faced die). So far this is quite easy, but the confidence interval is another affair, and illustrates quite well the idea of “add-on”. One way to find it is to find all the values of for which , where is the confidence level (usually chosen to be 5%). For a given , this probability is equal to which yields a CI of the form , so there we have it!2
Now let’s put a Bayesian cap and see what we can do. First of all, we already saw that with observations, ( here), so we’re set with the likelihood. The prior, as I mentioned before, is something you choose. You basically have to decide on some distribution you think the parameter is likely to obey. But hear me: it doesn’t have to be perfect as long as it’s reasonable! What the prior does is basically give some initial information, like a boost, to your Bayesian modeling. The only thing you should make sure of is to give support to any value you think might be relevant (so always choose a relatively wide distribution). Here for example, I’m going to choose a super uninformative prior: the uniform distribution with for some very large (say 100). Then using Bayes’ theorem, the posterior distribution is . The symbol means it’s true up to a normalization constant, so we can rewrite the whole distribution as
where the denominator is called the Hurwitz zeta function, a fast-converging series. At this stage, the Bayesian statistician would compute the maximum a posterior estimation (MAP) given by the maximum of the distribution (which is at ), or the mean . A credible interval can be obtained now by just looking at the cumulative distribution function for the posterior distribution and finding the values for which it covers 95% of the probability mass. For this problem we can just do it for a few values and see where it stops, leading to the interval [4,5]:
| n | F(n) | |---|---| | 4 | 0.816 | | 5 | 0.953 | | 6 | 0.985 |
So this result agrees quite well with the frequentist approach, and the uncertainty here can be interpreted as a consequence of
- using a veryuninformative prior , - having little data to observe.
If both the likelihood and the prior carry little information, then the posterior will be very uncertain. This is a perfect example where we can see how using a different prior, one which includes some knowledge about the problem, can help. Since is an integer which is likely close to 4, I will use a geometric distribution as prior , with . In the piece of code below, I use pymc to
Source: Hacker News










