Processing math: 42%

Tuesday, 22 November 2016

Using the "ones trick" to handle unbalanced missing data with JAGS

The so-called "ones trick" for JAGS and BUGS models allows the user to sample from distributions that are not in the standard list. Here I show another application for "unbalanced" missing data. More examples using the ones (or zeros) trick can be found in The BUGS Book, and in particular in Chapter 9.

Suppose that we want to find an association between a trait x (possibly continuous) and another categorical trait k{1,2,,K}. We have N observations xi, however, some of the ki are missing. Whenever there is no information on the missing ki whatsoever, we can simply write the following JAGS model:
data {
   /* alpha is the parameter for the Dirichlet prior of p, 
    * with p the estimated frequency of k
    */
   for ( j in 1:K ) {
      alpha[j] <- 1
   }
}
model {
   for ( i in 1:N ) {
      /* Some model for x, given k */
      x[i] ~ dnorm(xhat[k[i]], tau_x)
      /* Sample missing values of k, and inform
       * the distribution p with the known k's
       */
      k[i] ~ dcat(p)
   }
   /* Model the distribution p_k of the trait k */
   p ~ ddirch(alpha)

   /* The precision of the likelihood for x */
   tau_x ~ dgamma(0.01, 0.01)   

   /* Give the xhat's a shared prior to regularize them */
   for ( j in 1:K ) {
      xhat[j] ~ dnorm(mu_xhat, tau_xhat)
   }
   mu_xhat ~ dnorm(0, 0.01)
   tau_xhat ~ dgamma(0.01, 0.01)
}
The data file must contain the vector k, with NA in the place of the missing values.

Suppose now that the missing kis are not completely unknown. In stead, suppose that we know that some of the ki must belong to a group gi. The group gi is encoded as a binary vector, where a gi,j=1 indicates that the trait value j is in the group, and 0 that it is not. In particular, when ki is known, then gi,j={1if j=ki0otherwise If ki is completely unknown, then we just let gi,j=1 for each j.
data {
   for ( j in 1:K ) {
      alpha[j] <- 1
   }
   /* for the "ones trick", 
    * we need a 1 for every observation 
    */
   for ( i in 1:N ) {
      ones[i] <- 1
   }
}
model {
   for ( i in 1:N ) {
      x[i] ~ dnorm(xhat[k[i]], tau_x)
      /* sample missing k's using the group-vector g */
      k[i] ~ dcat(g[i,])
      /* in order to inform p correctly,
       * we need to multiply the posterior with p[k[i]],
       * which can be done by observing a 1, 
       * assumed to be Bernoulli(p[k[i]]) distributed
       */
      ones[i] ~ dbern(p[k[i]])
   }
   p ~ ddirch(alpha)

   tau_x ~ dgamma(0.01, 0.01)   
   for ( j in 1:K ) {
      xhat[j] ~ dnorm(mu_xhat, tau_xhat)
   }
   mu_xhat ~ dnorm(0, 0.01)
   tau_xhat ~ dgamma(0.01, 0.01)
}
In stead of using the "ones trick", it would have been more clear-cut if we were able to write
k[i] ~ dcat(g[i,]) /* sampling statement */
k[i] ~ dcat(p) /* statement to inform p */
but in this way JAGS can not tell that it is to sample from Categorical(gi), and not from Categorical(p).
Similarly, it might have been tempting to write
k[i] ~ dcat(g[i,] * p) /* point-wise vector multiplication */
This would correctly prefer the common ks in group gi over the rare ks, but the fact that ki must come from the group gi is not used to inform p.

As an example, I generated some random kis sampled from Categorical(p) (I did not bother to sample any xis). I have taken K=15, N=2000, and 3 randomly chosen groups. For a 1000 observations, I "forget" the actual ki, and only know the group gi. The goal is to recover p and the missing kis.
Using a chain of length 20000 (using the first 10000 as a burn-in and 1/10-th thinning) we get the following result:


Thursday, 5 May 2016

Computing Bayes factors with Stan and a path-sampling method

Stan is a great program for MCMC (or HMC, really). Vehtari et al. explain here how to use Stan to compute WAIC. For the Bayes factor, however, I have not found a method yet, and therefore I would like to demonstrate a possible method here. This will obviously not work well for every model; this is merely an experiment.
Recently, I was really intrigued by a paper by Gelman and Meng, where several methods for computing Bayes factors, or normalizing constants, are explained and connected (even the really bad ones). Here, I will use the path sampling method.
Let us implement a simple model in Stan, for which we can explicitly compute the marginal likelihood. Then we can try to estimate this marginal likelihood with the path-sampling method, and compare it with the exact value.
A very simple model is the 'fair coin' example (taken directly from wikipedia). The Bayes factor between a null-model M0 and a model that incorporates a bias, M1, can be computed directly as the quotient of the marginal likelihoods, and moreover, the null model does not have any parameters.
Let n denote the number of coin tosses, and k the number of 'heads'. Hence, the data D is given by the pair (n,k). Given a prior θBeta(α,β) on the probability of throwing heads, we get the posterior θBeta(α+k,β+nk), and we can compute the marginal likelihood exactly:
p(D|M1)=10p(D|θ)π(θ)dθ = {n \choose k}\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1} d\theta = {n \choose k} B(k+\alpha, n-k+\beta)\,,
where B denotes the Beta function. Meanwhile,
p(D | M_0) = {n \choose k} \left(\tfrac12\right)^k\left(1-\tfrac12\right)^{n-k}\,.
In this instance of path sampling (and closely following Gelman and Meng), we consider a family of (un-normalized) distributions Q_T, indexed by a parameter T \in [0,1], such that Q_0(\theta) = \pi(\theta) and Q_1(\theta) = p(D|\theta)\pi(\theta). The normalizing constants are denoted by z(T). Notice that z(0) = 1 and z(1) = p(D|M_1).
Let \Theta = [0,1] denote the support of \theta. Since
\frac{d}{dT} \log z(T) = \frac{1}{z(T)} \frac{d}{dT} z(T) = \frac{1}{z(T)} \frac{d}{dT} \int_{\Theta} Q_T(\theta) d\theta\,,
we get that
\frac{d}{dT} \log z(T) = \int_{\Theta} \frac{1}{z(T)} \frac{d}{dT} Q_T(\theta) d\theta\,,
and hence
\frac{d}{dT} \log z(T) = \int_{\Theta} \frac{Q_T(\theta)}{z(T)} \frac{d}{dT} \log(Q_T(\theta)) d\theta\,.
When we denote by \mathbb{E}_T the expectation under P_T, we get that
\frac{d}{dT} \log z(T) = \int_{\Theta} P_T(\theta) \frac{d}{dT} \log(Q_T(\theta)) d\theta = \mathbb{E}_T\left[ \frac{d}{dT} \log(Q_T(\theta)) \right]\,.
We can think of U(\theta, T) := \frac{d}{dT} \log(Q_T(\theta)) as 'potential energy', and we get
\int_0^1 \mathbb{E}_T\left[U(\theta, T)\right] dT = \int_0^1 \frac{d}{dT} \log(z(T)) dT = \log(z(1)) - \log(z(0)) = \log(z(1)/z(0)) =: \lambda\,.
Notice that in our case \lambda = \log(P(D|M_1)). We can interpret \int_0^1 \mathbb{E}_T\left[U(\theta, T)\right] dT as the expectation of U over the joint probability density of T (with a uniform prior) and \theta:
\lambda = \mathbb{E}\left[ U(\theta, T)\right]\,.
This suggests an estimator \hat{\lambda} for \lambda:
\hat{\lambda} = \frac{1}{N} \sum_{i=1}^N U(\theta_i, T_i)\,,
where (\theta_i, T_i)_i is a sample from the joint distribution of \theta and T. A way of creating such a sample, is first sampling T_i from its marginal (uniform) distribution, and then sampling \theta_i from P_{T_i}. This last step might require some Monte Carlo sampling.
First, we need to choose 1-parameter family of distributions. A simple choice is the geometric path:
Q_T(\theta) = \pi(\theta)^{1-T} (\pi(\theta)p(D|\theta))^{T} = \pi(\theta) p(D|\theta)^T\,.
In this case, the potential energy simply equals \frac{d}{dT}\log(Q_T(\theta)) = \log p(D|\theta)

The Stan model

Using the pystan interface, we can implement the model as follows. The most important parts are the parameter T (declared in the data section), and the "generated quantity" U.
## import some modules
import pystan
import scipy.stats as sts
import scipy.special as spl
import numpy as np
import multiprocessing

## define a Stan model
model = """
data {
    int<lower=0> n;
    int<lower=0, upper=n> k;
    real<lower=0> alpha;
    real<lower=0> beta;
    real<lower=0, upper=1> T; // parameter for path sampling
}
parameters {
    real<lower=0, upper=1> theta;
}
model {
    theta ~ beta(alpha, beta);
    increment_log_prob(T*binomial_log(k, n, theta));
    // replaces sampling statement "k ~ binomial(n, theta)"
}
generated quantities {
    real U;
    U <- binomial_log(k, n, theta);
}
"""

## let Stan translate this into C++, and compile...
sm = pystan.StanModel(model_code=model)

A parallel method

We need to generate samples T_i from {\rm Uniform}(0,1), and then, given T_i we generate a sample \theta_i from P_{T_i}. The simplest way is just to make a partition T_i = \tfrac{i}{N} of [0,1] and then for each i=0,\dots,N, use the Stan model with T = T_i. Notice that for each i, we will generate multiple (K, say) samples from P_{T_i}. This method lends itself well for multi-processing, as all N+1 Stan sessions can run in parallel.
## choose some parameters
k = 10 ## heads
n = 100 ## coin tosses
alp = 1 ## determines prior for q
bet = 1 ## determines prior for q
K = 100 ## length of each chain
N = 1000 ## number of Ts

## a function that prepares a data dictionary,
## and then runs the Stan model
def runStanModel(T):
    coin_data = {
        'n' : n, 
        'k' : k, 
        'alpha' : alp,
        'beta' : bet,
        'T' : T
    }
    fit = sm.sampling(data=coin_data, iter=2*K, 
                      warmup=K, chains=1) 
    la = fit.extract(permuted=True)
    return la['U'] ## U is a "generated quantity"

## make a partition of [0,1] 
Ts = np.linspace(0, 1, N+1)
## start a worker pool
pool = multiprocessing.Pool(4) ## 4 threads
## for each T in Ts, run the Stan model
Us = np.array(pool.map(runStanModel, Ts))
Let's have a look at the result. Notice that for \alpha=\beta=1, the marginal likelihood does not depend on k as P(D|M_1) = {n \choose k} \frac{\Gamma(k+1)\Gamma(n-k+1)}{\Gamma(n+2)} = {n \choose k} \frac{k! (n-k)!}{(n+1)!} = \frac{1}{n+1} We could take for \hat{\lambda} the average of all (N+1)\cdot K samples, but in my experience, the standard error is more realistic when I only take one sample per T_i.
## take one sample for each T
lamhat = np.mean(Us[:,-1]) 
## we can also compute a standard error!!
se_lamhat = sts.sem(Us[:,-1])

print "extimated lambda = %f +/- %f"%(lamhat, se_lamhat)
print "estimated p(D|M_1) = %f"%np.exp(lamhat)

exactMargLike = spl.beta(k+alp, n-k+bet) * spl.binom(n,k)
exactMargLoglike = np.log(exactMargLike)

print "exact lambda = %f"%exactMargLoglike
print "exact p(D|M_1) = %f"%exactMargLike
In my case, the result is
estimated lambda = -4.724850 +/- 0.340359
estimated p(D|M_1) = 0.008872
exact lambda = -4.615121
exact p(D|M_1) = 0.009901

A serial method

Another method does not use parallel processing, but uses the fact that the distributions P_{T_i} and P_{T_{i+1}} are very similar when T_{i}-T_{i+1} = \frac{1}{N} is small. When we have a sample from P_{T_i}, we can use it as the initial condition for the Stan run with T = T_{i+1}. We then only need very little burn-in (warm-up) time before we are actually sampling from P_{T_{i+1}}. We can specify the number of independent chains that Stan computes, and also separate initial parameters for each of the chains. Hence, we can take multiple samples P_{T_i} as initial choices for the next chain. For this very simple model, this "serial" method is much slower than the parallel method, but my guess is that it could be a lot faster for more complicated models. I hope to prove this claim in a future post.
## choose some parameters
k = 10 ## heads
n = 100 ## coin tosses
alp = 1 ## determines prior for q
bet = 1 ## determines prior for q
K = 100 ## size initial chain
N = 200 ## number of Ts

## initially, do a longer run with T=0
coin_data = {
    'n' : n, 
    'k' : k, 
    'alpha' : alp,
    'beta' : bet,
    'T' : 0
}
fit = sm.sampling(data=coin_data, iter=2*K, 
                  warmup=K, chains=1)
la = fit.extract(permuted=True)

## in stead of length K, 
## now use a much shorter chain (of length L)
L = 10 
chains = 4 
Us = np.zeros(shape=(N+1,L*chains))
Ts = np.linspace(0, 1, N+1)

## now run the 'chain of chains'
for i, Ti in enumerate(Ts):
    coin_data['T'] = Ti ## take another T
    ## take some thetas from the previous sample
    thetas = np.random.choice(la["theta"], chains)
    initial_guesses = [{'theta' : theta} for theta in thetas]
    fit = sm.sampling(data=coin_data, iter=2*L, warmup=L, 
                      chains=chains, init=initial_guesses)
    la = fit.extract(permuted=True)
    Us[i,:] = la['U']
Ok, let us have another look at the result. For this, I used the same code as above:
estimated lambda = -5.277354 +/- 1.120274
estimated p(D|M_1) = 0.005106
exact lambda = -4.615121
exact p(D|M_1) = 0.009901
As you can see, the estimate is less precise, but this is due to the fact that N=200 instead of 1000.
I've written in, and I ran the above code fragments from a Jupyter notebook. As compiling and sampling can take a lot of time, such an interface can be very convenient. Please let me know when this was somehow useful for you or if you have any questions, and also please tell me if I did something stupid...

Monday, 2 May 2016

A Sunset Colormap for Python

Room Z533 in the Kruyt building in Utrecht can have a wonderful view, especially at sunset. Not too long ago, I took the following photograph.

Having a love-hate relationship with colormaps, and in particular choosing a good one, I INSTANTLY noticed the beauty of the gradient of the sky, and hence, its application as a colormap.

In Python, it is easy to import figures as a matrix, and then extract the RGB values along a line. To make things easier, I first cropped out the part of the sky I wanted:
Then, I had a look at this website.
Let's import the "slice of sky" in Python and have a look:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LinearSegmentedColormap
from PIL import Image

im = Image.open("/path/to/sunset.jpg")
pix = im.load()

xs = range(im.size[1])
y = im.size[0]/2
rs = [pix[y,x][0] for x in xs]
gs = [pix[y,x][1] for x in xs]
bs = [pix[y,x][2] for x in xs]

fig = plt.figure(figsize=(5,2))
ax = fig.add_subplot(111)

ax.plot(xs, rs, color='red')
ax.plot(xs, gs, color='green')
ax.plot(xs, bs, color='blue')
ax.set_xlabel("pixel")
ax.set_ylabel('r/g/b value')

plt.savefig('rgb-plot.png', dpi=300, bboxinches='tight')

This results in the following figure:

Now let's make the actual colormap.
Dp = 1 ## determines the number of pixels between "nodes"
xs = np.linspace(0,1,im.size[1]/Dp + 1) ## points between 0 and 1 
idxs = range(0,im.size[1],Dp) ## indices in the original picture matrix

redlist = [rs[idx]/255. for idx in idxs] + [rs[-1]/255., 0]
greenlist = [gs[idx]/255. for idx in idxs] + [gs[-1]/255., 0]
bluelist = [bs[idx]/255. for idx in idxs] + [bs[-1]/255., 0]

## LinearSegmentedColormap wants these weird triples, 
## where some end points are ignored...

redtuples = [(x, redlist[i], redlist[i+1]) for i, x in enumerate(xs)]
greentuples = [(x, greenlist[i], greenlist[i+1]) for i, x in enumerate(xs)]
bluetuples = [(x, bluelist[i], bluelist[i+1]) for i, x in enumerate(xs)]

cdict = {'red' : redtuples, 'green' : greentuples, 'blue' : bluetuples}

cmap = LinearSegmentedColormap('tbz533', cdict) ## choose a name
plt.register_cmap(cmap=cmap)
## now we can use our new colormap!

OK. Let's make some 70s wallpaper to celebrate the new colormap:
xs = np.linspace(-5,5,1000)
ys = np.linspace(-5,5,1000)
zs = np.array([[np.sin(x)*np.cos(y) for x in xs] for y in ys])

fig = plt.figure(figsize=(5.5,5))
ax1 = fig.add_subplot(1,1,1)
C = ax1.pcolormesh(xs, ys, zs, cmap='tbz533')
fig.colorbar(C)

fig.savefig('my70swallpaper.png', dpi=200, bboxinches='tight')

This results in the following figure:

That's it! Please leave your own favorite colormap in the comments.