By Michel Betancourt
\mathfrak{Michael "Shapes Dude" Betancourt}
Everyone loves statistics by simulation/sampling these days, but samples are not magic and they do not allow you to implement every important probabilistic operation in a straightforward way. A short thread on the the common ways to work with probability distributions.
12:00 PM · Aug 26, 2021·Twitter Web App
Everyone loves statistics by simulation/sampling these days, but samples are not magic and they do not allow you to implement every important probabilistic operation in a straightforward way. A short thread on the the common ways to work with probability distributions.
There are basically two ways that we can actually implement probability theory in practice, and by that I mean estimate expectation values of various functions with respect to a given probability distribution.
One approach is to compare the given probability distribution to a convenient uniform measure, which yields a probability density function. On continuous spaces integrating the probability density function gives expectation values, and calculus tells us how to integrate things.
The other is to generate sequences of points that are compatible with the given probability distribution in a very formal way. Empirical averages of these sequences, i.e. Monte Carlo, are straightforward to implement and approximate expectation values in a well-defined way.
Empirical averages are much easier to implement than exact or numerical integration. So long as we can generate those compatible sequences then the simulation/sampling/Monte Carlo approach will typically be the most useful in practice.
Expectation values allow us to characterize not only the given probability distribution but also any _pushforward_ of that distribution with respect to some transformation to a new space.
More formally if pi is our given distribution over the space X and f : X -> Y maps from X to a new space Y then the pushforward f_* pi is a distribution over Y. In other words if pi is the distribution of inputs, x, then f_* pi is the corresponding distribution of outputs, f(x).
Now the expectation value of any function g : Y -> R with respect to the pushforward distribution, E_{f_* pi}[g], is equal to the expectation value of the composition function g \circ f (apply f first, then g) with respect to our original distribution, E_{pi}[g \circ f]
In other words any expectation value E_{f_* pi}[g] that characterizes some useful property of the pushforward distribution can be computed back on our original space with a suitable change to the expectand. This is how we actually examine summaries and marginals in practice.
This has immeidate consequences for our two implementation strategies. E_{f_* pi}[g] = E_{pi}[g \circ f] implies that the pushforward probability density function is given by a really nasty integral over the inverse level sets f^{-1}(y) that I won't even try to write here.
Samples, however, transform much easier. If (x_1, ..., x_N) are samples from our given distribution pi then (f(x_1), .., f(x_N)) are samples from the pushforward distribution f_* pi! All we have to do is map each sample to the output space and we get pushforward samples.
In other words it's much easier to construct samples from a pushforward distribution than a probability density function. This is why deterministic transformations are so trivial to incorporate along side probabilistic transformations in simulations.
So sampling allows us to readily characterize a given distribution and any transformation of that distribution. That means we only ever need to sample, right? We never have to look at probability density functions ever again? Not quite.
See once we introduce a function f : X -> Y there's one other important operation that arises. The pushforward considers how a distribution of inputs maps to a distribution of outputs, but what happens to the inputs when we _fix_ the output?
A _conditional probability distribution_ slices through the given probability distribution on X, quantifying all of the input values compatible with a fixed output value tilde{y} \in Y. This conditioning operation is essentially how we _learn_ in probability theory.
Unfortunately, and unlike pushforward distributions, conditional distributions cannot be characterized by expectations with respect to the original distribution pi in a straightforward way.
The problem is that we have to somehow restrict our distribution, and the range of any expectation, to the set f^{-1}(tilde{y}) \subset X which can be arbitrarily nasty when Y has fewer dimensions than X.
For example let's say that we wanted to transform samples from the original distribution to samples from some conditional distribution. Only those samples x_n with the right output value, f(x_n) = tilde{y}, will correspond to samples from the conditional distribution.
In other words all we can do is rejection sampling, generating lots and lots of samples and keeping only those with the right output value. If Y is a continuous space, however, then the probability of generating a sample with the exact right output value is...zero.
What about probability density functions? Conveniently restricting a function to a subset is actually really easy! We just have to multiply it by a function that's zero everywhere outside f^{-1}(tilde{y}) and one everywhere inside f^{-1}(tilde{y}).
We call that an indicator function, I[f(x) = tilde{y}]. Consequently the conditional probability density function is just pi(x | tilde{y}) = pi(x) * I[f(x) = tilde{y}]. It really is that easy.
To summarize the typically excessive thread so far, pushforward distributions are much easier to characterize with samples but conditional distributions are much easier to characterize with probability density functions. _Neither is sufficient on its own_.
I've been super general here but much of this will probably be more familiar if we restrict attention to product spaces and projection functions. Before revealing the conclusion let's look at this case for a second.
Let's say that our initial distribution pi is defined over the product space X = Y times Z and our transformation isolates the first component, f : X -> Y. We can represent pi with the probability density function pi(y, z) or the samples ( (y_1, z_1), ..., (y_N, z_N)).
The pushforward distribution f_* pi is the marginal distribution over Y. The marginal density function is given by a nasty integral, f_* pi(y) = \int dz pi(y, z), where as the pushforward samples are given by just getting rid of the z_n, f(y_n, z_n) = y_n.
If we fix y to some value tilde{y} then the distribution pi on Y times Z defines a conditional distribution on Z. The conditional probably density function is given by partially evaluating the probability density function for pi, pi(z | tilde{y}) = constant * pi(tilde{y}, z).
Unless we happen to have a sample with f(y_n, z_n) = tilde{y} then our initial ensemble won't actually contain any samples from the conditional distribution. Again trying to generate enough samples from pi to get even a few conditional samples is a hopeless endeavor.
Everyone on the same page? Great. Now that we know the advantages and disadvantages of probability densities and samples we can consider when it's best to use them in practice.
In Bayesian inference the absolute first operation we need to implement is conditioning on observed data to turn a joint distribution over the observational and model configuration spaces into a posterior distribution over the model conjugation space.
Once we have constructed our posterior distribution the remaining operations reduce to computing expectation values to characterize that posterior as well as any meaningful pushforward distributions.
he very definition of Bayesian inference requires us to condition first and then pushforward second. This motivates starting with a probability density function representation and then switching to a sampling representation after we've conditioned.
And hey this is exactly what Stan does! The Stan modeling language provides a way for you to define joint probability density functions pi(y, theta) over the observational space, y in Y, and the model configuration space, theta in Theta.
Once observed values of the data tilde{y} are available Stan can immediately condition this joint density function with partial evaluation, pi(theta | tilde{y}) = constant * pi(tilde{y}, theta), which gives the conditional posterior density function.
Finally Stan uses this conditional posterior density function to inform Hamiltonian Monte Carlo how to generate posterior samples, which can then be used to characterize all of the important bits of the posterior distribution.
For some problems -- namely when the data generating process is defined as a pushforward of some latent process -- constructing explicit joint density functions pi(y, theta) is a pain, but it's a pain that we have to face in order to enable conditioning.
Finally some reaches of the statistics literature will tell you that turning joint samples from pi(y, theta) into conditional samples from pi(theta | tilde{y}) isn't as impossible as I claimed above.
In particular what if you loosen the problem up a bit and accept samples with values that are only _close_ to tilde{y}? Maybe those samples will approximate conditional samples, and approximately characterize the conditional distribution, in some meaningful way?
This is the hope of "Approximate Bayesian Computation", but unfortunately reader that hope is as bad as the naming conventions for statistical computation methods (_all_ Bayesian computation is approximation, dammit).
Conceptually the performance of the method will scale with the slack given to accepting samples, which itself will scale inversely with the accuracy of the approximation. The only way to generate samples at reasonable cost is to loosen the slack which then degrades the accuracy.
There is plenty of theoretical work in this area but none of it really characterizes the accuracy of the approximation in any practically useful way. Hopefully this conceptual understanding clarifies just how fundamentally limited this approach.
Especially in problems with more than a few dimensions where samples tend to move further and further apart form any point and even getting close to the fixed value becomes harder and harder.
Anyways probability theory is subtle mathematically and even more subtle in practice. In order to understand which methods have a chance of being helpful in our applied problems, and which are fundamentally doomed to fail, we need at least a conceptual understanding.
Otherwise we're at the mercy of people who happily promise the world because they don't have to deal with the consequences of those promises going unfulfilled.