Tuesday 18 April 2017

How to calculate WBIC with Stan

In a previous post, I showed how to use Stan to calculate the Bayes marginal likelihood with the path sampling method. For large models, this method can be computationally expensive, and therefore estimates for the marginal likelihood have been developed. The most famous one is the Bayesian Information Criterion (BIC), an estimate for the Bayes free energy. Although the name suggests otherwise, the BIC is not a "true Bayesian" estimate, since it depends only on a point estimate, and not the entire posterior distribution: \[ {\sf BIC} = -2 L(\hat{\theta}) + k \cdot \log(n)\,, \] where \(L\) is the log-likelihood function, \(\hat{\theta}\) is the maximum likelihood estimate of the models parameters, \(k\) is the dimension of \(\theta\) and \(n\) is the number of data points.
More importantly, the BIC does not work for singular models. As Watanabe points out, if a statistical model contains hierarchical layers or hidden variables then it is singular. Watanabe, therefore, proposes a generalization of the BIC, the WBIC, that works for singular models as well, and is truly Bayesian. This is not unlike AIC versus WAIC.

Suppose that you have some sort of MCMC algorithm to sample from the posterior distribution of your model, given your data. When you want to compute the WAIC, you will just have to store the likelihoods of each observation, given each sampled set of parameters. This can be done during an actual run of the MCMC algorithm. For the WBIC, however, you will need to do a separate run at a different "temperature". In fact, Watanabe gives the following formula: \[ {\sf WBIC} = -2\mathbb{E}_{\theta}^{\beta}[L(\theta)]\,, \] where \(\beta = 1/\log(n)\), and the notation \(\mathbb{E}_{\theta}^{\beta}[f(\theta)]\) expresses that the expectation of \(f\) is taken with respect to the distribution with PDF \[ \theta \mapsto \frac{\exp(\beta L(\theta)) \pi(\theta)}{\int\exp(\beta L(\theta)) \pi(\theta)d\theta} \] which equals the posterior distribution when \(\beta = 1\) and the prior distribution \(\pi\) when \(\beta = 0\).

Something very similar happens in the path sampling example, with the exception that the "path" is replaced by a single point. In a recent paper about another alternative to the BIC, the Singular BIC (sBIC), Drton and Plummer point out the similarity to the mean value theorem for integrals. Using the path sampling method, the log-marginal-likelihood is computed from the following integral \[ \log(P(D|M)) = \int_0^1 \mathbb{E}_{\theta}^{\beta}[L(\theta)]\, d\beta\,. \] Here, we write \(P(D|M)\) for the marginal likelihood of model \(M\), and data \(D\). According to the mean value theorem \[ \exists \beta^{\ast} \in (0,1): \log(P(D|M)) = \mathbb{E}_{\theta}^{\beta^{\ast}}[L(\theta)]\,, \] and Watanabe argues that choosing \(1/\log(n)\) for \(\beta^{\ast}\) gives a good approximation. Notice that the mean value theorem does not provide a recipe to compute \(\beta^{\ast}\), and that Watanabe uses rather advanced algebraic geometry to prove that his choice for \(\beta^{\ast}\) is good; the mean value theorem only gives a minimal justification for Watanabe's result.

Implementing WBIC in Stan

Let us start with a very simple example: the biased coin model. Suppose that we have \(N\) coin flips, with outcomes \(x \in \{0,1\}^N\) (heads = 1 and tails = 0), and let \(K\) denote the number of heads. We compare the null model (the coin is fair) with the alternative model (the coin is biased) using \(\Delta {\sf WBIC}\) and see if we get the "right" answer. Notice that for this simple example, we can compute the Bayes factor exactly.

In order to have a Stan model that can be used for both regular sampling, and estimating WBIC, a sampling mode is passed with the data to determine the desired behavior. In the transformed data block, a parameter watanabe_beta is then defined, determining the sampling "temperature".
The actual model is defined in the transformed parameters block, such that log_likes can be used in both the model block as the generated quantities block. In stead of a sampling statement (x[n] ~ bernoulli(p)), we have to use the bernoulli_lpmf function, which allows us to scale the log-likelihood with watanabe_beta (i.e. target += watanabe_beta * sum(log_likes)).
// coin_model.stan

data {
    int<lower=3> N; // number of coin tosses
    int<lower=0, upper=1> x[N]; // outcomes (heads = 1, tails = 0)
    int<lower=0, upper=1> mode; // 0 = normal sampling, 1 = WBIC sampling
}
transformed data {
    real<lower=0, upper=1> watanabe_beta; // WBIC parameter
    if ( mode == 0 ) {
        watanabe_beta = 1.0;
    }
    else { // mode == 1
        watanabe_beta = 1.0/log(N);
    }
}
parameters {
    real<lower=0, upper=1> p; // probability of heads
}
transformed parameters {
    vector[N] log_likes;
    for ( n in 1:N ) {
        log_likes[n] = bernoulli_lpmf(x[n] | p);
    }
}
model {
    p ~ beta(1, 1);
    target += watanabe_beta * sum(log_likes);
}
generated quantities {
    real log_like;
    log_like = sum(log_likes);
}
Using the pystan module for Python (3), one could estimate \(p\) from data \(x\) as follows:
import pystan
import numpy as np
import scipy.stats as sts

## compile the model
sm = pystan.StanModel(file="coin_model.stan")

## create random data
N = 50
p_real = 0.75
x = sts.bernoulli.rvs(p_real, size=N)

## prepare data for Stan
data = {'N' : N, 'x' : x, 'mode' : 0}
pars = ['p', 'log_like']

## run the Stan model
fit = sm.sampling(data=data)
chain = fit.extract(permuted=True, pars=pars)
ps = chain['p']
print("p =", np.mean(ps), "+/-", np.std(ps))
Which should give output similar to
p = 0.67187824104 +/- 0.0650602081466
In order to calculate the WBIC, the mode has to be set to 1:
## import modules, compile the Stan model, and create data as before...

## prepare data for Stan
data = {'N' : N, 'x' : x, 'mode' : 1} ## notice the mode
pars = ['p', 'log_like']

## run the Stan model
fit = sm.sampling(data=data)
chain = fit.extract(permuted=True, pars=pars)
WBIC = -2*np.mean(chain["log_like"])

print("WBIC =", WBIC)
which should result in something similar to
WBIC = 66.036854275
In order to test if this result is any good, we first compute the WBIC of the null model. This is easy since the null model does not have any parameters \[ {\sf WBIC} = -2\cdot L(x) = -2 \cdot \log(\tfrac12^{K}(1-\tfrac12)^{N-K}) = 2N\cdot\log(2) \] and hence, when \(N=50\), we get \({\sf WBIC} \approx 69.3\). Hence, we find positive evidence against the null model, since \(\Delta{\sf WBIC} \approx 3.3\).

For the coin-toss model, we know the posterior density explicitly, namely \[ P(x|M) = \int_0^1 p^K (1-p)^{N-K} dp = B(K+1, N-K+1)\,, \] where \(M\) denotes the alternative (i.e. biased) model, and \(B\) denotes the Beta function. In the above example, we had \(N = 50\) and \(K = 34\), and hence \(-2\cdot\log(P(x|M)) \approx 66.31 \), which is remarkably close to the \({\sf WBIC}\) of the alternative model. Similar results hold for several values of \(p\), as demonstrated by the following figure

A non-trivial example

As the example above is non-singular, and WBIC is supposed to work also for singular models, I plan to present an example with a singular model here.