Saturday, 18 May 2019

A thread-safe global random number generator using C++ <random> and <mutex>

In a previous post, Sandro showed how to use the C++ <random> header to define a global random number generator (RNG). If at some point the user decides to parallelize their program, they would have to make the global RNG thread-safe. This requires only a small modification.

I have copied Sandro's code and added thread-safety using the C++ (with standard at least c++11) headers <thread> and <mutex>. In the header file thread_safe_random.hpp, the functions RANDOM and Seed are declared.

The C++ file thread_safe_random.cpp defines the functions RANDOM and Seed, and the RNG and uniform distribution as before. Additionally, a mutex my_rng_mutex is defined to guard my_rng. I am using a lock_guard to lock and release the mutex. When one thread of execution calls the function RANDOM, it acquires the mutex. Any other thread that calls RANDOM, has to wait until the mutex is released.

In order to demonstrate thread_safe_random, I created max threads in the main function that use the auxiliary function fetch_random_number to call RANDOM.

The result should look like
$ g++ --std=c++11 -pthread main.cpp thread_safe_random.cpp -o test_safe_random
$ ./test_safe_random
0.284779
0.243487
0.161906
0.338338
0.235765
0.502853
0.389262
0.165401
0.244871
0.194046
However, the order of these numbers can change each time you execute the program. This means that the program is no longer deterministic (although we can argue about what it means to be deterministic or not), because the OS determines the order in which the threads call RANDOM. Another problem with this implementation is that it will be slow when each thread needs many random numbers.

Friday, 17 May 2019

Copying polymorphic C++ objects using an inherited dup() method

In order to copy polymorphic objects in C++, it can be convenient to equip the base and derived classes with a .dup() method (or .clone() method) that returns a pointer to a copy of the (derived) object. When you have a large amount of different derived classes, overriding the base class's .dup() method for each of them can be a bit of a nuisance. In order to solve this, I sometimes use an "intermediate" class template that can be inherited instead of the base class to provide the .dup() method. This solution is not perfect, because it does not provide the possibility of covariant return types.

The class template Cloneable is defined as follows:

In the following code snippet, the use of the Cloneable is demonstrated:
If someone has a better way to do this, let me know.

Thursday, 1 November 2018

The simplest example on c++11 <random>

For some reasons, the internet lacks a beginner-level tutorial on how to use the c++11 <random> library when your code is made up of different files.
This is not a tutorial, this is just what I managed to put together...

I want
  • one global random engine (e.g. the mersenne twister)
  • one real-value uniform distribution in the interval [0,1), let's call this function RANDOM()
  • to be able to call RANDOM() from anywhere in my code.

There are three files: random.h, random.cpp and main.cpp (any additional .cpp file that includes random.h can use the function RANDOM() ).
The content of the files is as follow:
 
random.h
// random.h
#ifndef _RND_HH_
#define _RND_HH_

#include <random> //--- FOR THIS YOU NEED c++11, enable with -std=c++11 flag

// Declare engine - single instance for the whole code
//extern std::mt19937 my_rng;
extern std::mt19937_64 my_rng;

//Declare distributions:
extern std::uniform_real_distribution<double> my_unif_real_dist;
//extern std::uniform_int_distribution<double> my_unif_int_dist;

int Seed(int seed);
double RANDOM();

#endif 

// end of random.h 
random.cpp
//random.cpp
#include <stdio.h>
#include <iostream>
#include <chrono>
#include "random.h"

//std::mt19937 my_rng {}; 
std::mt19937_64 my_rng {}; // Defines an engine
std::uniform_real_distribution<double> my_unif_real_dist(0., 1.); //Define distribution
// Function to seed the random number generator from main file
// useful if you want the seed from a parameter file
// a negative value for seed gets you a random seed
// outputs the seed itself
int Seed(int seed)
{
  if (seed < 0) {
    long rseed=static_cast<long unsigned int>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
    std::cerr << "Randomizing random generator, seed is "<<rseed<<std::endl;
    my_rng.seed(rseed);
    return rseed;
  } else {
    std::cerr << "User-provided seed is "<<seed<<std::endl;
    my_rng.seed(seed);
    return seed;
  }
}
// This is the function to call if you want a random number in the interval [0,1)
double RANDOM(void)
{
  return my_unif_real_dist(my_rng);
}
// end of random.cpp
And finally for the main.cpp file
//main.cpp
#include <stdio.h>
#include <iostream>
#include "random.h"

int main()
{
  int max = 10;
  int my_seed = 235;
  
  int my_new_Seed = Seed(my_seed);
  
  for(int i=0; i<max;++i){
    double one_random_number = RANDOM();
    std::cerr << RANDOM() << std::endl;
  }
}// end of main.cpp
That's it! This is really all you need.
Compile it with:
g++ -std=c++11 random.cpp main.cpp -o my_pretty_random_numbers
and happy random number generation.

By the way, this seems to work fine on my machine (running Ubuntu 18).
Can this be improved in simplicity and/or performance? Are there bugs?
Please let me know!

Sunday, 3 June 2018

Easy (Bayesian) multidimensional scaling with Stan

Multidimensional scaling (MDS) is a data visualization technique in which the dimension of the data is reduced in a non-linear way. The data is represented as a \(N\times N\) distance matrix \((d_{ij})_{ij}\), and \(N\) points \(x_i\) in a \(D\) dimensional space (typically \(D=2\)) are chosen such that the Euclidean distances \(\|x_i - x_j\|\) resemble the input distances \(d_{ij}\) "as good as possible".

In metric MDS, an objective function \(E(x) = \sum_{1\leq i < j \leq N} (d_{ij} - \|x_i - x_j\|)^2\) is defined that needs to be minimized. For different flavors of MDS, this objective function is defined differently. In order to minimize the objective function, e.g. the conjugate gradient descent method is used. This method requires that one calculates the gradient \(\nabla E\) of the objective function. Of course, this is not so difficult is the case of metric MDS, but more difficult objective functions might require more effort. Enter Stan.

Stan uses automatic differentiation for Hamiltonian Monte Carlo, but Stan can also be used for maximum likelihood. Hence, if we can formulate the MDS problem in terms of a likelihood function, we can let Stan do all the work. The parameters of the model are the \(x_i\), the data is given by the distances \(d_{ij}\). If we assume that given the parameters, the data is distributed as \[ d_{ij} \sim \mathcal{N}(\|x_i - x_j\|, \sigma^2)\,, \] then maximizing the (log) likelihood is equivalent to minimizing the function \(E\). The parameter \(\sigma^2\) is merely a nuisance parameter that needs to be estimated as well.

An implementation of MDS in the Stan programming language

Implementing MDS in Stan is fairly straightforward, but there are a few snags that we should be aware of. First, if \(x\) solves the MDS problem, then also any Euclidean transformation of \(x\) is a solution. Hence, the model as stated above has too many parameters. We solve this by fixing the first point at the origin, restricting the next point to a \(1\)-dimensional half space, the third point to a \(2\)-dimensional half space et cetera. The last \(N-D-1\) points are unrestricted. In Stan, we can accomplish this by using a cholesky_factor_cov matrix: A positive-definite lower-trangular matrix. We then use the transformed parameters block to concatenate the points together into a single matrix.

Secondly, the data that we use in the example below is highly censored. Many of the distances are missing, and some are right censored. In such a case MDS can be used to infer the missing distances, and not merely visualize the data. The data that is passed to Stan, therefore, is a list of edges, a list of distances, and a list of codes that determine the type of censoring.

Thirdly, as the title of this post suggests, we will use Stan to do some sort of Bayesian MDS. In this case, we will sample a collection of "maps" \(x\) from a posterior distribution, that gives information about the location of each point, but also the uncertainty of this location. In this case, the fact that we restrict the first \(D+1\) points, comes back to bite us, as the uncertainty of these points will be different than the unrestricted points. Furthermore, it might be hard to compare the individual maps to one another, and for instance compute sensible mean locations of the points, as some maps may be "twisted" more than others. Therefore, we use the generated quantities block to center and rotate (cf. PCA) the sampled maps.



Example: Antigenic cartography of the influenza virus

An interesting application of MDS is antigenic cartography of the influenza virus. Influenza virus circumvents human antibody responses by continually evolving its surface proteins, in particular, hemagglutinin (HA). This is known as antigenic drift. In order to decide whether flu vaccines need to be updated, the hemagglutination inhibition (HI) assay is used to determine if the induced antibody response is still effective against future strains. The titers measured in the HI assay can be used to define "distances" between antisera and antigens. Using MDS, the antisera and antigens can be drawn into a "map", that shows the antigenic drift of the virus. This was done by Smith et al. in 2004. Conveniently, the data used for their map is available online. This table gives HI titers \(H_{ij}\) of antigen \(i\) and antiserum \(j\). A small titer corresponds to a large distance, which are defined as \(d_{ij} = \log_2(H_{\max,j}) - \log_2(H_{ij})\), where \(H_{\max,j} = \max_{k} H_{kj}\). As an example, I recreated their antigenic map using the Stan model above, and a Python script below.


The white squares denote the relative positions of the antisera in the "antigenic space", while the colored circles represent the antigens. The colors map to the years in which the infuenza strains were isolated.

Bayesian multidimensional scaling

For antigenic cartography of IAV, Bayesian MDS has been introduced by Bedford et al., who used multiple HI assay results per antigen/antiserum pair to incorporate the uncertainty of these measurements in their antigenic map. Moreover, they were able to use genetic and temporal information about the antigens (i.e. the RNA sequences of HA and their isolation dates) to inform the position of the antigens and antisera on the map. We will not go this far in this post, but since we have already formulated the MDS algorithm in Stan, we might as well make a "Bayesian" antigenic map. This can give some insight into the uncertainty of the positions of the antigens and antisera. This is not unlike the confidence areas as drawn by Smith et al. (the ellipsoid shapes). The result is given by the following figure.


Again, squares indicate antisera and colored circles the antigens. All the individual MCMC samples are represented by the grey dots. The MCMC samples for each antigen or antiserum are used to draw a two-dimensional error bar (i.e. ellipse) around the mean location.
A Python script for parsing the HI titer data, compiling and running the Stan model and drawing the maps is added below. For it to work, you will need to download the mds_model.stan file and make a csv file called baselinemap.csv with the HI table

Tuesday, 6 February 2018

Computing q-values with C++

When looking for associations between features \(i = 1,\dots, m\) and some trait, it is often necessary to have some sort of multiple-testing correction. A very conservative method is the Bonferroni correction, that minimizes the family-wise error rate (FWER), but at the cost of many false negatives. This is not desirable when one wants to discover features or associations, and therefore other methods have been developed. One particularly intuitive method is based on the false discovery rate (FDR) and uses so-called \(q\)-values, which are (under certain conditions) elegantly analogous to \(p\)-values.

False discovery rate

Let \(S\) be the number of features called significant, and \(F\) the number of false positives among the significant features (i.e. false discoveries). In an article by John D. Storey, The (positive) false discovery rate is defined as \[ {\rm pFDR} := \mathbb{E}\left[\left.\frac{F}{S}\right| S > 0 \right]\,. \] Hence, the pFDR is the expected fraction of false positives among the features that are called significant. The condition \(S > 0\) ensures that it is well defined.

In the case of hypothesis testing, one typically has a test statistic \(T\), and one wants to test if the null hypothesis is true (\(H = 0\)), or rather that the alternative hypothesis is true (\(H = 1\)). The statistical model specifies the distribution of \(T | H = 0\), and the null hypothesis is rejected when the realization of \(T\) falls into a pre-defined significance region \(\Gamma\).
When testing multiple features, we typically have a sequence \((T_i, H_i)_{i=1}^m\), here assumed to be identically distributed and independent. The \({\rm pFDR}\) then depends on \(\Gamma\): \[ {\rm pFDR}(\Gamma) = \mathbb{E}\left[\left.\frac{F(\Gamma)}{S(\Gamma)}\right| S(\Gamma) > 0 \right] \,, \] where \(F(\Gamma) := \#\{i : T_i \in \Gamma \wedge H_i = 0 \} \) and \(S(\Gamma) := \#\{i : T_i \in \Gamma\}\). Storey derives that under certain conditions, we can write \[ {\rm pFDR}(\Gamma) = \mathbb{P}[H = 0 | T \in \Gamma] = \frac{\mathbb{E}[F(\Gamma)]}{\mathbb{E}[S(\Gamma)]} \]

The q-value

Let \(\{\Gamma_{\alpha}\}_{\alpha=0}^1\) be a nested family of significance regions. That is \(\Gamma_{\alpha} \subseteq \Gamma_{\alpha'}\) whenever \(\alpha \leq \alpha'\) and \(\mathbb{P}[T \in \Gamma_{\alpha} | H=0] = \alpha\). For instance, if \(T | H = 0 \sim \mathcal{N}(0,1)\), then we could choose \(\Gamma_{\alpha} = [z_{\alpha}, \infty)\), where \(z_{\alpha} = \Phi^{-1}(\alpha)\), with \(\Phi\) the CDF if \(\mathcal{N}(0,1)\).
The \(q\)-value of a realization \(t\) of \(T\) is then defined as \[ q(t) = \inf_{\Gamma_{\alpha} | t \in \Gamma_{\alpha}} {\rm pFDR}(\Gamma_{\alpha}) \] We can now give the above-mentioned analogy between \(p\)-values and \(q\)-values. The \(p\)-value is defined as: \[ p(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \mathbb{P}[T \in \Gamma_{\alpha} | H = 0] \] While under the right conditions, the \(q\)-value can be written as: \[ q(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \mathbb{P}[H = 0 | T \in \Gamma_{\alpha} ] \]

Computing q-values

In order to compute \(q\)-values, given a sequence of \(p\)-values, we follow the steps given in a paper by Storey and Tibshirani In this scenario, the \(p\)-value plays the role of the realization \(t\) of the statistic \(T\). Under the null hypothesis, these \(p\)-values are uniformly distributed. As a family of significance regions, we simply take \(\Gamma_{\alpha} = [0,\alpha]\), and write for instance \(S(\alpha) := S(\Gamma_{\alpha})\).

First, we have to estimate \(\mathbb{E}[S(\alpha)]\), for which we use \(\#\{i : p_i \leq \alpha\}\), and we estimate \(\mathbb{E}[F(\alpha)]\) with \(m \cdot \hat{\pi}_0 \cdot \alpha\), where \(\hat{\pi}_0\) is an estimate for \(\pi_0 = \mathbb{P}[H=0]\).
The most difficult part of computing \(q\)-values is estimating \(\pi_0\). An often used estimate is the average \(p\)-value, but we can do a little bit better by making use of the empirical CDF of the \(p\)-values. The figure below shows a "typical" histogram of \(p\)-values, where the \(p\)-values sampled from the alternative distribution are given by the gray bars. The marginal distribution of \(p\)-values becomes relatively flat towards the right, where most \(p\)-values should come from the null distribution, which is \({\rm Uniform}(0,1)\). The right panel of the figure shows a "rotated" empirical CDF of the \(p\)-values, i.e. \[ x \mapsto R(x) = 1-{\rm CDF}(1-x)\,, \] and the fact that large \(p\)-values should be uniformly distributed is resembled by the fact that \(R(x)\) is a straight line for small \(x\). The slope of this straight line is an estimate of \(\pi_0\). In the C++ code below, I use GSL to fit a line though \(R\) (indicated by the red line in the figure), using a weight function \(x \mapsto (1-x)^2\) to give more precedence to small values of \(x\), thereby mostly ignoring the non-straight part of \(R\).

In this example, the marginal \(p\)-values are sampled from a mixture distribution \(\pi_0 {\rm Uniform}(0,1) + (1-\pi_0) {\rm Beta}(10,1)\), where \(\pi_0 = 3/4\).


Now we sort the \(p\)-values such that \(p_1 \leq p_2 \leq \dots \leq p_m\), such that \(\#\{j : p_j \leq p_i\} = i\) and first determine the \(q\)-value corresponding to \(p_m\): \[ q_m = \hat{\pi}_0 \cdot p_m \] The \(q\)-value \(q_i\) corresponding to \(p\)-value \(p_i\) is computed as \[ q_i = \min(\hat{\pi}_0 p_i m/i, q_{i+1}) \] Recently, I had to implement this algorithm in C++. The function is given in the following header file, and accepts a "keyed" list of \(p\)-values, where the key is used to identify the feature. The function returns a keyed list of \(q\)-values.



The following code gives an example of how to use qvalues.hpp, and can be used for testing:


After compiling and running the program, the result should be something like:
$ g++ -O3 -std=c++11 qvalues_test.cpp -lgsl -lgslcblas -o qvaltest
$ ./qvaltest 1728
                true false
  discoveries:   143    32
    negatives:   712   113
realized FDR: 0.182857

Tuesday, 18 April 2017

How to calculate WBIC with Stan

In a previous post, I showed how to use Stan to calculate the Bayes marginal likelihood with the path sampling method. For large models, this method can be computationally expensive, and therefore estimates for the marginal likelihood have been developed. The most famous one is the Bayesian Information Criterion (BIC), an estimate for the Bayes free energy. Although the name suggests otherwise, the BIC is not a "true Bayesian" estimate, since it depends only on a point estimate, and not the entire posterior distribution: \[ {\sf BIC} = -2 L(\hat{\theta}) + k \cdot \log(n)\,, \] where \(L\) is the log-likelihood function, \(\hat{\theta}\) is the maximum likelihood estimate of the models parameters, \(k\) is the dimension of \(\theta\) and \(n\) is the number of data points.
More importantly, the BIC does not work for singular models. As Watanabe points out, if a statistical model contains hierarchical layers or hidden variables then it is singular. Watanabe, therefore, proposes a generalization of the BIC, the WBIC, that works for singular models as well, and is truly Bayesian. This is not unlike AIC versus WAIC.

Suppose that you have some sort of MCMC algorithm to sample from the posterior distribution of your model, given your data. When you want to compute the WAIC, you will just have to store the likelihoods of each observation, given each sampled set of parameters. This can be done during an actual run of the MCMC algorithm. For the WBIC, however, you will need to do a separate run at a different "temperature". In fact, Watanabe gives the following formula: \[ {\sf WBIC} = -2\mathbb{E}_{\theta}^{\beta}[L(\theta)]\,, \] where \(\beta = 1/\log(n)\), and the notation \(\mathbb{E}_{\theta}^{\beta}[f(\theta)]\) expresses that the expectation of \(f\) is taken with respect to the distribution with PDF \[ \theta \mapsto \frac{\exp(\beta L(\theta)) \pi(\theta)}{\int\exp(\beta L(\theta)) \pi(\theta)d\theta} \] which equals the posterior distribution when \(\beta = 1\) and the prior distribution \(\pi\) when \(\beta = 0\).

Something very similar happens in the path sampling example, with the exception that the "path" is replaced by a single point. In a recent paper about another alternative to the BIC, the Singular BIC (sBIC), Drton and Plummer point out the similarity to the mean value theorem for integrals. Using the path sampling method, the log-marginal-likelihood is computed from the following integral \[ \log(P(D|M)) = \int_0^1 \mathbb{E}_{\theta}^{\beta}[L(\theta)]\, d\beta\,. \] Here, we write \(P(D|M)\) for the marginal likelihood of model \(M\), and data \(D\). According to the mean value theorem \[ \exists \beta^{\ast} \in (0,1): \log(P(D|M)) = \mathbb{E}_{\theta}^{\beta^{\ast}}[L(\theta)]\,, \] and Watanabe argues that choosing \(1/\log(n)\) for \(\beta^{\ast}\) gives a good approximation. Notice that the mean value theorem does not provide a recipe to compute \(\beta^{\ast}\), and that Watanabe uses rather advanced algebraic geometry to prove that his choice for \(\beta^{\ast}\) is good; the mean value theorem only gives a minimal justification for Watanabe's result.

Implementing WBIC in Stan

Let us start with a very simple example: the biased coin model. Suppose that we have \(N\) coin flips, with outcomes \(x \in \{0,1\}^N\) (heads = 1 and tails = 0), and let \(K\) denote the number of heads. We compare the null model (the coin is fair) with the alternative model (the coin is biased) using \(\Delta {\sf WBIC}\) and see if we get the "right" answer. Notice that for this simple example, we can compute the Bayes factor exactly.

In order to have a Stan model that can be used for both regular sampling, and estimating WBIC, a sampling mode is passed with the data to determine the desired behavior. In the transformed data block, a parameter watanabe_beta is then defined, determining the sampling "temperature".
The actual model is defined in the transformed parameters block, such that log_likes can be used in both the model block as the generated quantities block. In stead of a sampling statement (x[n] ~ bernoulli(p)), we have to use the bernoulli_lpmf function, which allows us to scale the log-likelihood with watanabe_beta (i.e. target += watanabe_beta * sum(log_likes)).
// coin_model.stan

data {
    int<lower=3> N; // number of coin tosses
    int<lower=0, upper=1> x[N]; // outcomes (heads = 1, tails = 0)
    int<lower=0, upper=1> mode; // 0 = normal sampling, 1 = WBIC sampling
}
transformed data {
    real<lower=0, upper=1> watanabe_beta; // WBIC parameter
    if ( mode == 0 ) {
        watanabe_beta = 1.0;
    }
    else { // mode == 1
        watanabe_beta = 1.0/log(N);
    }
}
parameters {
    real<lower=0, upper=1> p; // probability of heads
}
transformed parameters {
    vector[N] log_likes;
    for ( n in 1:N ) {
        log_likes[n] = bernoulli_lpmf(x[n] | p);
    }
}
model {
    p ~ beta(1, 1);
    target += watanabe_beta * sum(log_likes);
}
generated quantities {
    real log_like;
    log_like = sum(log_likes);
}
Using the pystan module for Python (3), one could estimate \(p\) from data \(x\) as follows:
import pystan
import numpy as np
import scipy.stats as sts

## compile the model
sm = pystan.StanModel(file="coin_model.stan")

## create random data
N = 50
p_real = 0.75
x = sts.bernoulli.rvs(p_real, size=N)

## prepare data for Stan
data = {'N' : N, 'x' : x, 'mode' : 0}
pars = ['p', 'log_like']

## run the Stan model
fit = sm.sampling(data=data)
chain = fit.extract(permuted=True, pars=pars)
ps = chain['p']
print("p =", np.mean(ps), "+/-", np.std(ps))
Which should give output similar to
p = 0.67187824104 +/- 0.0650602081466
In order to calculate the WBIC, the mode has to be set to 1:
## import modules, compile the Stan model, and create data as before...

## prepare data for Stan
data = {'N' : N, 'x' : x, 'mode' : 1} ## notice the mode
pars = ['p', 'log_like']

## run the Stan model
fit = sm.sampling(data=data)
chain = fit.extract(permuted=True, pars=pars)
WBIC = -2*np.mean(chain["log_like"])

print("WBIC =", WBIC)
which should result in something similar to
WBIC = 66.036854275
In order to test if this result is any good, we first compute the WBIC of the null model. This is easy since the null model does not have any parameters \[ {\sf WBIC} = -2\cdot L(x) = -2 \cdot \log(\tfrac12^{K}(1-\tfrac12)^{N-K}) = 2N\cdot\log(2) \] and hence, when \(N=50\), we get \({\sf WBIC} \approx 69.3\). Hence, we find positive evidence against the null model, since \(\Delta{\sf WBIC} \approx 3.3\).

For the coin-toss model, we know the posterior density explicitly, namely \[ P(x|M) = \int_0^1 p^K (1-p)^{N-K} dp = B(K+1, N-K+1)\,, \] where \(M\) denotes the alternative (i.e. biased) model, and \(B\) denotes the Beta function. In the above example, we had \(N = 50\) and \(K = 34\), and hence \(-2\cdot\log(P(x|M)) \approx 66.31 \), which is remarkably close to the \({\sf WBIC}\) of the alternative model. Similar results hold for several values of \(p\), as demonstrated by the following figure

A non-trivial example

As the example above is non-singular, and WBIC is supposed to work also for singular models, I plan to present an example with a singular model here.

Tuesday, 22 November 2016

Using the "ones trick" to handle unbalanced missing data with JAGS

The so-called "ones trick" for JAGS and BUGS models allows the user to sample from distributions that are not in the standard list. Here I show another application for "unbalanced" missing data. More examples using the ones (or zeros) trick can be found in The BUGS Book, and in particular in Chapter 9.

Suppose that we want to find an association between a trait \(x\) (possibly continuous) and another categorical trait \(k \in \{1, 2,\dots, K\}\). We have \(N\) observations \(x_i\), however, some of the \(k_i\) are missing. Whenever there is no information on the missing \(k_i\) whatsoever, we can simply write the following JAGS model:
data {
   /* alpha is the parameter for the Dirichlet prior of p, 
    * with p the estimated frequency of k
    */
   for ( j in 1:K ) {
      alpha[j] <- 1
   }
}
model {
   for ( i in 1:N ) {
      /* Some model for x, given k */
      x[i] ~ dnorm(xhat[k[i]], tau_x)
      /* Sample missing values of k, and inform
       * the distribution p with the known k's
       */
      k[i] ~ dcat(p)
   }
   /* Model the distribution p_k of the trait k */
   p ~ ddirch(alpha)

   /* The precision of the likelihood for x */
   tau_x ~ dgamma(0.01, 0.01)   

   /* Give the xhat's a shared prior to regularize them */
   for ( j in 1:K ) {
      xhat[j] ~ dnorm(mu_xhat, tau_xhat)
   }
   mu_xhat ~ dnorm(0, 0.01)
   tau_xhat ~ dgamma(0.01, 0.01)
}
The data file must contain the vector \(k\), with NA in the place of the missing values.

Suppose now that the missing \(k_i\)s are not completely unknown. In stead, suppose that we know that some of the \(k_i\) must belong to a group \(g_i\). The group \(g_i\) is encoded as a binary vector, where a \(g_{i,j} = 1\) indicates that the trait value \(j\) is in the group, and \(0\) that it is not. In particular, when \(k_i\) is known, then \[ g_{i,j} = \left\{ \begin{array}{ll} 1 & {\rm if\ } j = k_i \\ 0 & {\rm otherwise} \end{array} \right. \] If \(k_i\) is completely unknown, then we just let \(g_{i,j} = 1\) for each \(j\).
data {
   for ( j in 1:K ) {
      alpha[j] <- 1
   }
   /* for the "ones trick", 
    * we need a 1 for every observation 
    */
   for ( i in 1:N ) {
      ones[i] <- 1
   }
}
model {
   for ( i in 1:N ) {
      x[i] ~ dnorm(xhat[k[i]], tau_x)
      /* sample missing k's using the group-vector g */
      k[i] ~ dcat(g[i,])
      /* in order to inform p correctly,
       * we need to multiply the posterior with p[k[i]],
       * which can be done by observing a 1, 
       * assumed to be Bernoulli(p[k[i]]) distributed
       */
      ones[i] ~ dbern(p[k[i]])
   }
   p ~ ddirch(alpha)

   tau_x ~ dgamma(0.01, 0.01)   
   for ( j in 1:K ) {
      xhat[j] ~ dnorm(mu_xhat, tau_xhat)
   }
   mu_xhat ~ dnorm(0, 0.01)
   tau_xhat ~ dgamma(0.01, 0.01)
}
In stead of using the "ones trick", it would have been more clear-cut if we were able to write
k[i] ~ dcat(g[i,]) /* sampling statement */
k[i] ~ dcat(p) /* statement to inform p */
but in this way JAGS can not tell that it is to sample from \({\rm Categorical}(g_i)\), and not from \({\rm Categorical}(p)\).
Similarly, it might have been tempting to write
k[i] ~ dcat(g[i,] * p) /* point-wise vector multiplication */
This would correctly prefer the common \(k\,\)s in group \(g_i\) over the rare \(k\,\)s, but the fact that \(k_i\) must come from the group \(g_i\) is not used to inform \(p\).

As an example, I generated some random \(k_i\)s sampled from \({\rm Categorical}(p)\) (I did not bother to sample any \(x_i\)s). I have taken \(K = 15\), \(N = 2000\), and \(3\) randomly chosen groups. For a \(1000\) observations, I "forget" the actual \(k_i\), and only know the group \(g_i\). The goal is to recover \(p\) and the missing \(k_i\)s.
Using a chain of length \(20000\) (using the first \(10000\) as a burn-in and \(1/10\)-th thinning) we get the following result: