Tuesday 22 November 2016

Using the "ones trick" to handle unbalanced missing data with JAGS

The so-called "ones trick" for JAGS and BUGS models allows the user to sample from distributions that are not in the standard list. Here I show another application for "unbalanced" missing data. More examples using the ones (or zeros) trick can be found in The BUGS Book, and in particular in Chapter 9.

Suppose that we want to find an association between a trait \(x\) (possibly continuous) and another categorical trait \(k \in \{1, 2,\dots, K\}\). We have \(N\) observations \(x_i\), however, some of the \(k_i\) are missing. Whenever there is no information on the missing \(k_i\) whatsoever, we can simply write the following JAGS model:
data {
   /* alpha is the parameter for the Dirichlet prior of p, 
    * with p the estimated frequency of k
    */
   for ( j in 1:K ) {
      alpha[j] <- 1
   }
}
model {
   for ( i in 1:N ) {
      /* Some model for x, given k */
      x[i] ~ dnorm(xhat[k[i]], tau_x)
      /* Sample missing values of k, and inform
       * the distribution p with the known k's
       */
      k[i] ~ dcat(p)
   }
   /* Model the distribution p_k of the trait k */
   p ~ ddirch(alpha)

   /* The precision of the likelihood for x */
   tau_x ~ dgamma(0.01, 0.01)   

   /* Give the xhat's a shared prior to regularize them */
   for ( j in 1:K ) {
      xhat[j] ~ dnorm(mu_xhat, tau_xhat)
   }
   mu_xhat ~ dnorm(0, 0.01)
   tau_xhat ~ dgamma(0.01, 0.01)
}
The data file must contain the vector \(k\), with NA in the place of the missing values.

Suppose now that the missing \(k_i\)s are not completely unknown. In stead, suppose that we know that some of the \(k_i\) must belong to a group \(g_i\). The group \(g_i\) is encoded as a binary vector, where a \(g_{i,j} = 1\) indicates that the trait value \(j\) is in the group, and \(0\) that it is not. In particular, when \(k_i\) is known, then \[ g_{i,j} = \left\{ \begin{array}{ll} 1 & {\rm if\ } j = k_i \\ 0 & {\rm otherwise} \end{array} \right. \] If \(k_i\) is completely unknown, then we just let \(g_{i,j} = 1\) for each \(j\).
data {
   for ( j in 1:K ) {
      alpha[j] <- 1
   }
   /* for the "ones trick", 
    * we need a 1 for every observation 
    */
   for ( i in 1:N ) {
      ones[i] <- 1
   }
}
model {
   for ( i in 1:N ) {
      x[i] ~ dnorm(xhat[k[i]], tau_x)
      /* sample missing k's using the group-vector g */
      k[i] ~ dcat(g[i,])
      /* in order to inform p correctly,
       * we need to multiply the posterior with p[k[i]],
       * which can be done by observing a 1, 
       * assumed to be Bernoulli(p[k[i]]) distributed
       */
      ones[i] ~ dbern(p[k[i]])
   }
   p ~ ddirch(alpha)

   tau_x ~ dgamma(0.01, 0.01)   
   for ( j in 1:K ) {
      xhat[j] ~ dnorm(mu_xhat, tau_xhat)
   }
   mu_xhat ~ dnorm(0, 0.01)
   tau_xhat ~ dgamma(0.01, 0.01)
}
In stead of using the "ones trick", it would have been more clear-cut if we were able to write
k[i] ~ dcat(g[i,]) /* sampling statement */
k[i] ~ dcat(p) /* statement to inform p */
but in this way JAGS can not tell that it is to sample from \({\rm Categorical}(g_i)\), and not from \({\rm Categorical}(p)\).
Similarly, it might have been tempting to write
k[i] ~ dcat(g[i,] * p) /* point-wise vector multiplication */
This would correctly prefer the common \(k\,\)s in group \(g_i\) over the rare \(k\,\)s, but the fact that \(k_i\) must come from the group \(g_i\) is not used to inform \(p\).

As an example, I generated some random \(k_i\)s sampled from \({\rm Categorical}(p)\) (I did not bother to sample any \(x_i\)s). I have taken \(K = 15\), \(N = 2000\), and \(3\) randomly chosen groups. For a \(1000\) observations, I "forget" the actual \(k_i\), and only know the group \(g_i\). The goal is to recover \(p\) and the missing \(k_i\)s.
Using a chain of length \(20000\) (using the first \(10000\) as a burn-in and \(1/10\)-th thinning) we get the following result:


Thursday 5 May 2016

Computing Bayes factors with Stan and a path-sampling method

Stan is a great program for MCMC (or HMC, really). Vehtari et al. explain here how to use Stan to compute WAIC. For the Bayes factor, however, I have not found a method yet, and therefore I would like to demonstrate a possible method here. This will obviously not work well for every model; this is merely an experiment.
Recently, I was really intrigued by a paper by Gelman and Meng, where several methods for computing Bayes factors, or normalizing constants, are explained and connected (even the really bad ones). Here, I will use the path sampling method.
Let us implement a simple model in Stan, for which we can explicitly compute the marginal likelihood. Then we can try to estimate this marginal likelihood with the path-sampling method, and compare it with the exact value.
A very simple model is the 'fair coin' example (taken directly from wikipedia). The Bayes factor between a null-model \(M_0\) and a model that incorporates a bias, \(M_1\), can be computed directly as the quotient of the marginal likelihoods, and moreover, the null model does not have any parameters.
Let \(n\) denote the number of coin tosses, and \(k\) the number of 'heads'. Hence, the data \(D\) is given by the pair \((n,k)\). Given a prior \(\theta \sim {\rm Beta}(\alpha, \beta)\) on the probability of throwing heads, we get the posterior \(\theta \sim {\rm Beta}(\alpha + k, \beta +n-k)\), and we can compute the marginal likelihood exactly:
\[ p(D|M_1) = \int_0^1 p(D|\theta)\pi(\theta) d\theta \] \[ = {n \choose k}\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1} d\theta = {n \choose k} B(k+\alpha, n-k+\beta)\,, \]
where \(B\) denotes the Beta function. Meanwhile,
\[ p(D | M_0) = {n \choose k} \left(\tfrac12\right)^k\left(1-\tfrac12\right)^{n-k}\,. \]
In this instance of path sampling (and closely following Gelman and Meng), we consider a family of (un-normalized) distributions \(Q_T\), indexed by a parameter \(T \in [0,1]\), such that \(Q_0(\theta) = \pi(\theta)\) and \(Q_1(\theta) = p(D|\theta)\pi(\theta)\). The normalizing constants are denoted by \(z(T)\). Notice that \(z(0) = 1\) and \(z(1) = p(D|M_1)\).
Let \(\Theta = [0,1]\) denote the support of \(\theta\). Since
\[ \frac{d}{dT} \log z(T) = \frac{1}{z(T)} \frac{d}{dT} z(T) = \frac{1}{z(T)} \frac{d}{dT} \int_{\Theta} Q_T(\theta) d\theta\,, \]
we get that
\[ \frac{d}{dT} \log z(T) = \int_{\Theta} \frac{1}{z(T)} \frac{d}{dT} Q_T(\theta) d\theta\,, \]
and hence
\[ \frac{d}{dT} \log z(T) = \int_{\Theta} \frac{Q_T(\theta)}{z(T)} \frac{d}{dT} \log(Q_T(\theta)) d\theta\,. \]
When we denote by \(\mathbb{E}_T\) the expectation under \(P_T\), we get that
\[ \frac{d}{dT} \log z(T) = \int_{\Theta} P_T(\theta) \frac{d}{dT} \log(Q_T(\theta)) d\theta = \mathbb{E}_T\left[ \frac{d}{dT} \log(Q_T(\theta)) \right]\,. \]
We can think of \(U(\theta, T) := \frac{d}{dT} \log(Q_T(\theta))\) as 'potential energy', and we get
\[ \int_0^1 \mathbb{E}_T\left[U(\theta, T)\right] dT = \int_0^1 \frac{d}{dT} \log(z(T)) dT \] \[ = \log(z(1)) - \log(z(0)) = \log(z(1)/z(0)) =: \lambda\,. \]
Notice that in our case \(\lambda = \log(P(D|M_1))\). We can interpret \(\int_0^1 \mathbb{E}_T\left[U(\theta, T)\right] dT\) as the expectation of \(U\) over the joint probability density of \(T\) (with a uniform prior) and \(\theta\):
\[ \lambda = \mathbb{E}\left[ U(\theta, T)\right]\,. \]
This suggests an estimator \(\hat{\lambda}\) for \(\lambda\):
\[ \hat{\lambda} = \frac{1}{N} \sum_{i=1}^N U(\theta_i, T_i)\,, \]
where \((\theta_i, T_i)_i\) is a sample from the joint distribution of \(\theta\) and \(T\). A way of creating such a sample, is first sampling \(T_i\) from its marginal (uniform) distribution, and then sampling \(\theta_i\) from \(P_{T_i}\). This last step might require some Monte Carlo sampling.
First, we need to choose \(1\)-parameter family of distributions. A simple choice is the geometric path:
\[ Q_T(\theta) = \pi(\theta)^{1-T} (\pi(\theta)p(D|\theta))^{T} = \pi(\theta) p(D|\theta)^T\,. \]
In this case, the potential energy simply equals \(\frac{d}{dT}\log(Q_T(\theta)) = \log p(D|\theta)\)

The Stan model

Using the pystan interface, we can implement the model as follows. The most important parts are the parameter \(T\) (declared in the data section), and the "generated quantity" \(U\).
## import some modules
import pystan
import scipy.stats as sts
import scipy.special as spl
import numpy as np
import multiprocessing

## define a Stan model
model = """
data {
    int<lower=0> n;
    int<lower=0, upper=n> k;
    real<lower=0> alpha;
    real<lower=0> beta;
    real<lower=0, upper=1> T; // parameter for path sampling
}
parameters {
    real<lower=0, upper=1> theta;
}
model {
    theta ~ beta(alpha, beta);
    increment_log_prob(T*binomial_log(k, n, theta));
    // replaces sampling statement "k ~ binomial(n, theta)"
}
generated quantities {
    real U;
    U <- binomial_log(k, n, theta);
}
"""

## let Stan translate this into C++, and compile...
sm = pystan.StanModel(model_code=model)

A parallel method

We need to generate samples \(T_i\) from \({\rm Uniform}(0,1)\), and then, given \(T_i\) we generate a sample \(\theta_i\) from \(P_{T_i}\). The simplest way is just to make a partition \(T_i = \tfrac{i}{N}\) of \([0,1]\) and then for each \(i=0,\dots,N\), use the Stan model with \(T = T_i\). Notice that for each \(i\), we will generate multiple (\(K\), say) samples from \(P_{T_i}\). This method lends itself well for multi-processing, as all \(N+1\) Stan sessions can run in parallel.
## choose some parameters
k = 10 ## heads
n = 100 ## coin tosses
alp = 1 ## determines prior for q
bet = 1 ## determines prior for q
K = 100 ## length of each chain
N = 1000 ## number of Ts

## a function that prepares a data dictionary,
## and then runs the Stan model
def runStanModel(T):
    coin_data = {
        'n' : n, 
        'k' : k, 
        'alpha' : alp,
        'beta' : bet,
        'T' : T
    }
    fit = sm.sampling(data=coin_data, iter=2*K, 
                      warmup=K, chains=1) 
    la = fit.extract(permuted=True)
    return la['U'] ## U is a "generated quantity"

## make a partition of [0,1] 
Ts = np.linspace(0, 1, N+1)
## start a worker pool
pool = multiprocessing.Pool(4) ## 4 threads
## for each T in Ts, run the Stan model
Us = np.array(pool.map(runStanModel, Ts))
Let's have a look at the result. Notice that for \(\alpha=\beta=1\), the marginal likelihood does not depend on \(k\) as \[ P(D|M_1) = {n \choose k} \frac{\Gamma(k+1)\Gamma(n-k+1)}{\Gamma(n+2)} = {n \choose k} \frac{k! (n-k)!}{(n+1)!} = \frac{1}{n+1} \] We could take for \(\hat{\lambda}\) the average of all \((N+1)\cdot K\) samples, but in my experience, the standard error is more realistic when I only take one sample per \(T_i\).
## take one sample for each T
lamhat = np.mean(Us[:,-1]) 
## we can also compute a standard error!!
se_lamhat = sts.sem(Us[:,-1])

print "extimated lambda = %f +/- %f"%(lamhat, se_lamhat)
print "estimated p(D|M_1) = %f"%np.exp(lamhat)

exactMargLike = spl.beta(k+alp, n-k+bet) * spl.binom(n,k)
exactMargLoglike = np.log(exactMargLike)

print "exact lambda = %f"%exactMargLoglike
print "exact p(D|M_1) = %f"%exactMargLike
In my case, the result is
estimated lambda = -4.724850 +/- 0.340359
estimated p(D|M_1) = 0.008872
exact lambda = -4.615121
exact p(D|M_1) = 0.009901

A serial method

Another method does not use parallel processing, but uses the fact that the distributions \(P_{T_i}\) and \(P_{T_{i+1}}\) are very similar when \(T_{i}-T_{i+1} = \frac{1}{N}\) is small. When we have a sample from \(P_{T_i}\), we can use it as the initial condition for the Stan run with \(T = T_{i+1}\). We then only need very little burn-in (warm-up) time before we are actually sampling from \(P_{T_{i+1}}\). We can specify the number of independent chains that Stan computes, and also separate initial parameters for each of the chains. Hence, we can take multiple samples \(P_{T_i}\) as initial choices for the next chain. For this very simple model, this "serial" method is much slower than the parallel method, but my guess is that it could be a lot faster for more complicated models. I hope to prove this claim in a future post.
## choose some parameters
k = 10 ## heads
n = 100 ## coin tosses
alp = 1 ## determines prior for q
bet = 1 ## determines prior for q
K = 100 ## size initial chain
N = 200 ## number of Ts

## initially, do a longer run with T=0
coin_data = {
    'n' : n, 
    'k' : k, 
    'alpha' : alp,
    'beta' : bet,
    'T' : 0
}
fit = sm.sampling(data=coin_data, iter=2*K, 
                  warmup=K, chains=1)
la = fit.extract(permuted=True)

## in stead of length K, 
## now use a much shorter chain (of length L)
L = 10 
chains = 4 
Us = np.zeros(shape=(N+1,L*chains))
Ts = np.linspace(0, 1, N+1)

## now run the 'chain of chains'
for i, Ti in enumerate(Ts):
    coin_data['T'] = Ti ## take another T
    ## take some thetas from the previous sample
    thetas = np.random.choice(la["theta"], chains)
    initial_guesses = [{'theta' : theta} for theta in thetas]
    fit = sm.sampling(data=coin_data, iter=2*L, warmup=L, 
                      chains=chains, init=initial_guesses)
    la = fit.extract(permuted=True)
    Us[i,:] = la['U']
Ok, let us have another look at the result. For this, I used the same code as above:
estimated lambda = -5.277354 +/- 1.120274
estimated p(D|M_1) = 0.005106
exact lambda = -4.615121
exact p(D|M_1) = 0.009901
As you can see, the estimate is less precise, but this is due to the fact that \(N=200\) instead of \(1000\).
I've written in, and I ran the above code fragments from a Jupyter notebook. As compiling and sampling can take a lot of time, such an interface can be very convenient. Please let me know when this was somehow useful for you or if you have any questions, and also please tell me if I did something stupid...

Monday 2 May 2016

A Sunset Colormap for Python

Room Z533 in the Kruyt building in Utrecht can have a wonderful view, especially at sunset. Not too long ago, I took the following photograph.

Having a love-hate relationship with colormaps, and in particular choosing a good one, I INSTANTLY noticed the beauty of the gradient of the sky, and hence, its application as a colormap.

In Python, it is easy to import figures as a matrix, and then extract the RGB values along a line. To make things easier, I first cropped out the part of the sky I wanted:
Then, I had a look at this website.
Let's import the "slice of sky" in Python and have a look:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LinearSegmentedColormap
from PIL import Image

im = Image.open("/path/to/sunset.jpg")
pix = im.load()

xs = range(im.size[1])
y = im.size[0]/2
rs = [pix[y,x][0] for x in xs]
gs = [pix[y,x][1] for x in xs]
bs = [pix[y,x][2] for x in xs]

fig = plt.figure(figsize=(5,2))
ax = fig.add_subplot(111)

ax.plot(xs, rs, color='red')
ax.plot(xs, gs, color='green')
ax.plot(xs, bs, color='blue')
ax.set_xlabel("pixel")
ax.set_ylabel('r/g/b value')

plt.savefig('rgb-plot.png', dpi=300, bboxinches='tight')

This results in the following figure:

Now let's make the actual colormap.
Dp = 1 ## determines the number of pixels between "nodes"
xs = np.linspace(0,1,im.size[1]/Dp + 1) ## points between 0 and 1 
idxs = range(0,im.size[1],Dp) ## indices in the original picture matrix

redlist = [rs[idx]/255. for idx in idxs] + [rs[-1]/255., 0]
greenlist = [gs[idx]/255. for idx in idxs] + [gs[-1]/255., 0]
bluelist = [bs[idx]/255. for idx in idxs] + [bs[-1]/255., 0]

## LinearSegmentedColormap wants these weird triples, 
## where some end points are ignored...

redtuples = [(x, redlist[i], redlist[i+1]) for i, x in enumerate(xs)]
greentuples = [(x, greenlist[i], greenlist[i+1]) for i, x in enumerate(xs)]
bluetuples = [(x, bluelist[i], bluelist[i+1]) for i, x in enumerate(xs)]

cdict = {'red' : redtuples, 'green' : greentuples, 'blue' : bluetuples}

cmap = LinearSegmentedColormap('tbz533', cdict) ## choose a name
plt.register_cmap(cmap=cmap)
## now we can use our new colormap!

OK. Let's make some 70s wallpaper to celebrate the new colormap:
xs = np.linspace(-5,5,1000)
ys = np.linspace(-5,5,1000)
zs = np.array([[np.sin(x)*np.cos(y) for x in xs] for y in ys])

fig = plt.figure(figsize=(5.5,5))
ax1 = fig.add_subplot(1,1,1)
C = ax1.pcolormesh(xs, ys, zs, cmap='tbz533')
fig.colorbar(C)

fig.savefig('my70swallpaper.png', dpi=200, bboxinches='tight')

This results in the following figure:

That's it! Please leave your own favorite colormap in the comments.