In a previous post, I showed how to use Stan to calculate the Bayes marginal likelihood with the path sampling method.
For large models, this method can be computationally expensive, and therefore estimates for the marginal likelihood have been developed. The most famous one is the Bayesian Information Criterion (BIC), an estimate for the Bayes free energy.
Although the name suggests otherwise, the BIC is not a "true Bayesian" estimate, since it depends only on a point estimate, and not the entire posterior distribution:
\[
{\sf BIC} = -2 L(\hat{\theta}) + k \cdot \log(n)\,,
\]
where \(L\) is the log-likelihood function, \(\hat{\theta}\) is the maximum likelihood estimate of the models parameters, \(k\) is the dimension of \(\theta\) and \(n\) is the number of data points.
More importantly, the BIC does not work for singular models. As Watanabe points out, if a statistical model contains hierarchical layers or hidden variables then it is singular. Watanabe, therefore, proposes a generalization of the BIC, the WBIC, that works for singular models as well, and is truly Bayesian. This is not unlike AIC versus WAIC.
Suppose that you have some sort of MCMC algorithm to sample from the posterior distribution of your model, given your data. When you want to compute the WAIC, you will just have to store the likelihoods of each observation, given each sampled set of parameters. This can be done during an actual run of the MCMC algorithm. For the WBIC, however, you will need to do a separate run at a different "temperature". In fact, Watanabe gives the following formula: \[ {\sf WBIC} = -2\mathbb{E}_{\theta}^{\beta}[L(\theta)]\,, \] where \(\beta = 1/\log(n)\), and the notation \(\mathbb{E}_{\theta}^{\beta}[f(\theta)]\) expresses that the expectation of \(f\) is taken with respect to the distribution with PDF \[ \theta \mapsto \frac{\exp(\beta L(\theta)) \pi(\theta)}{\int\exp(\beta L(\theta)) \pi(\theta)d\theta} \] which equals the posterior distribution when \(\beta = 1\) and the prior distribution \(\pi\) when \(\beta = 0\).
Something very similar happens in the path sampling example, with the exception that the "path" is replaced by a single point. In a recent paper about another alternative to the BIC, the Singular BIC (sBIC), Drton and Plummer point out the similarity to the mean value theorem for integrals. Using the path sampling method, the log-marginal-likelihood is computed from the following integral \[ \log(P(D|M)) = \int_0^1 \mathbb{E}_{\theta}^{\beta}[L(\theta)]\, d\beta\,. \] Here, we write \(P(D|M)\) for the marginal likelihood of model \(M\), and data \(D\). According to the mean value theorem \[ \exists \beta^{\ast} \in (0,1): \log(P(D|M)) = \mathbb{E}_{\theta}^{\beta^{\ast}}[L(\theta)]\,, \] and Watanabe argues that choosing \(1/\log(n)\) for \(\beta^{\ast}\) gives a good approximation. Notice that the mean value theorem does not provide a recipe to compute \(\beta^{\ast}\), and that Watanabe uses rather advanced algebraic geometry to prove that his choice for \(\beta^{\ast}\) is good; the mean value theorem only gives a minimal justification for Watanabe's result.
In order to have a Stan model that can be used for both regular sampling, and estimating WBIC, a sampling
The actual model is defined in the
For the coin-toss model, we know the posterior density explicitly, namely \[ P(x|M) = \int_0^1 p^K (1-p)^{N-K} dp = B(K+1, N-K+1)\,, \] where \(M\) denotes the alternative (i.e. biased) model, and \(B\) denotes the Beta function. In the above example, we had \(N = 50\) and \(K = 34\), and hence \(-2\cdot\log(P(x|M)) \approx 66.31 \), which is remarkably close to the \({\sf WBIC}\) of the alternative model. Similar results hold for several values of \(p\), as demonstrated by the following figure
More importantly, the BIC does not work for singular models. As Watanabe points out, if a statistical model contains hierarchical layers or hidden variables then it is singular. Watanabe, therefore, proposes a generalization of the BIC, the WBIC, that works for singular models as well, and is truly Bayesian. This is not unlike AIC versus WAIC.
Suppose that you have some sort of MCMC algorithm to sample from the posterior distribution of your model, given your data. When you want to compute the WAIC, you will just have to store the likelihoods of each observation, given each sampled set of parameters. This can be done during an actual run of the MCMC algorithm. For the WBIC, however, you will need to do a separate run at a different "temperature". In fact, Watanabe gives the following formula: \[ {\sf WBIC} = -2\mathbb{E}_{\theta}^{\beta}[L(\theta)]\,, \] where \(\beta = 1/\log(n)\), and the notation \(\mathbb{E}_{\theta}^{\beta}[f(\theta)]\) expresses that the expectation of \(f\) is taken with respect to the distribution with PDF \[ \theta \mapsto \frac{\exp(\beta L(\theta)) \pi(\theta)}{\int\exp(\beta L(\theta)) \pi(\theta)d\theta} \] which equals the posterior distribution when \(\beta = 1\) and the prior distribution \(\pi\) when \(\beta = 0\).
Something very similar happens in the path sampling example, with the exception that the "path" is replaced by a single point. In a recent paper about another alternative to the BIC, the Singular BIC (sBIC), Drton and Plummer point out the similarity to the mean value theorem for integrals. Using the path sampling method, the log-marginal-likelihood is computed from the following integral \[ \log(P(D|M)) = \int_0^1 \mathbb{E}_{\theta}^{\beta}[L(\theta)]\, d\beta\,. \] Here, we write \(P(D|M)\) for the marginal likelihood of model \(M\), and data \(D\). According to the mean value theorem \[ \exists \beta^{\ast} \in (0,1): \log(P(D|M)) = \mathbb{E}_{\theta}^{\beta^{\ast}}[L(\theta)]\,, \] and Watanabe argues that choosing \(1/\log(n)\) for \(\beta^{\ast}\) gives a good approximation. Notice that the mean value theorem does not provide a recipe to compute \(\beta^{\ast}\), and that Watanabe uses rather advanced algebraic geometry to prove that his choice for \(\beta^{\ast}\) is good; the mean value theorem only gives a minimal justification for Watanabe's result.
Implementing WBIC in Stan
Let us start with a very simple example: the biased coin model. Suppose that we have \(N\) coin flips, with outcomes \(x \in \{0,1\}^N\) (heads = 1 and tails = 0), and let \(K\) denote the number of heads. We compare the null model (the coin is fair) with the alternative model (the coin is biased) using \(\Delta {\sf WBIC}\) and see if we get the "right" answer. Notice that for this simple example, we can compute the Bayes factor exactly.In order to have a Stan model that can be used for both regular sampling, and estimating WBIC, a sampling
mode
is passed with the data to determine the desired behavior.
In the transformed data
block, a parameter watanabe_beta
is then defined,
determining the sampling "temperature".
The actual model is defined in the
transformed parameters
block, such that log_likes
can be used in both the model
block as the generated quantities
block.
In stead of a sampling statement (x[n] ~ bernoulli(p)
),
we have to use the bernoulli_lpmf
function, which allows us to scale the log-likelihood
with watanabe_beta
(i.e. target += watanabe_beta * sum(log_likes)
).
// coin_model.stan data { int<lower=3> N; // number of coin tosses int<lower=0, upper=1> x[N]; // outcomes (heads = 1, tails = 0) int<lower=0, upper=1> mode; // 0 = normal sampling, 1 = WBIC sampling } transformed data { real<lower=0, upper=1> watanabe_beta; // WBIC parameter if ( mode == 0 ) { watanabe_beta = 1.0; } else { // mode == 1 watanabe_beta = 1.0/log(N); } } parameters { real<lower=0, upper=1> p; // probability of heads } transformed parameters { vector[N] log_likes; for ( n in 1:N ) { log_likes[n] = bernoulli_lpmf(x[n] | p); } } model { p ~ beta(1, 1); target += watanabe_beta * sum(log_likes); } generated quantities { real log_like; log_like = sum(log_likes); }Using the
pystan
module for Python (3), one could estimate \(p\) from data \(x\) as follows:
import pystan import numpy as np import scipy.stats as sts ## compile the model sm = pystan.StanModel(file="coin_model.stan") ## create random data N = 50 p_real = 0.75 x = sts.bernoulli.rvs(p_real, size=N) ## prepare data for Stan data = {'N' : N, 'x' : x, 'mode' : 0} pars = ['p', 'log_like'] ## run the Stan model fit = sm.sampling(data=data) chain = fit.extract(permuted=True, pars=pars) ps = chain['p'] print("p =", np.mean(ps), "+/-", np.std(ps))Which should give output similar to
p = 0.67187824104 +/- 0.0650602081466In order to calculate the WBIC, the
mode
has to be set to 1:
## import modules, compile the Stan model, and create data as before... ## prepare data for Stan data = {'N' : N, 'x' : x, 'mode' : 1} ## notice the mode pars = ['p', 'log_like'] ## run the Stan model fit = sm.sampling(data=data) chain = fit.extract(permuted=True, pars=pars) WBIC = -2*np.mean(chain["log_like"]) print("WBIC =", WBIC)which should result in something similar to
WBIC = 66.036854275In order to test if this result is any good, we first compute the WBIC of the null model. This is easy since the null model does not have any parameters \[ {\sf WBIC} = -2\cdot L(x) = -2 \cdot \log(\tfrac12^{K}(1-\tfrac12)^{N-K}) = 2N\cdot\log(2) \] and hence, when \(N=50\), we get \({\sf WBIC} \approx 69.3\). Hence, we find positive evidence against the null model, since \(\Delta{\sf WBIC} \approx 3.3\).
For the coin-toss model, we know the posterior density explicitly, namely \[ P(x|M) = \int_0^1 p^K (1-p)^{N-K} dp = B(K+1, N-K+1)\,, \] where \(M\) denotes the alternative (i.e. biased) model, and \(B\) denotes the Beta function. In the above example, we had \(N = 50\) and \(K = 34\), and hence \(-2\cdot\log(P(x|M)) \approx 66.31 \), which is remarkably close to the \({\sf WBIC}\) of the alternative model. Similar results hold for several values of \(p\), as demonstrated by the following figure