tag:blogger.com,1999:blog-83854277794306465682018-03-05T15:45:26.635-08:00tbz533Christiaan van Dorpnoreply@blogger.comBlogger10125tag:blogger.com,1999:blog-8385427779430646568.post-37974347001693503712018-02-06T09:08:00.000-08:002018-02-10T01:30:35.690-08:00Computing q-values with C++<div dir="ltr" style="text-align: left;" trbidi="on"> When looking for associations between features \(i = 1,\dots, m\) and some trait, it is often necessary to have some sort of multiple-testing correction. A very conservative method is the Bonferroni correction, that minimizes the family-wise error rate (FWER), but at the cost of many false negatives. This is not desirable when one wants to discover features or associations, and therefore other methods have been developed. One particularly intuitive method is based on the false discovery rate (FDR) and uses so-called \(q\)-values, which are (under certain conditions) elegantly analogous to \(p\)-values. <br/><h2> False discovery rate </h2> Let \(S\) be the number of features called significant, and \(F\) the number of false positives among the significant features (i.e. false discoveries). In <a href="https://projecteuclid.org/download/pdf_1/euclid.aos/1074290335">an article by John D. Storey</a>, The (positive) false discovery rate is defined as \[ {\rm pFDR} := \mathbb{E}\left[\left.\frac{F}{S}\right| S > 0 \right]\,. \] Hence, the pFDR is the expected fraction of false positives among the features that are called significant. The condition \(S > 0\) ensures that it is well defined. <br/><br/> In the case of hypothesis testing, one typically has a test statistic \(T\), and one wants to test if the null hypothesis is true (\(H = 0\)), or rather that the alternative hypothesis is true (\(H = 1\)). The statistical model specifies the distribution of \(T | H = 0\), and the null hypothesis is rejected when the realization of \(T\) falls into a pre-defined significance region \(\Gamma\). <br />When testing multiple features, we typically have a sequence \((T_i, H_i)_{i=1}^m\), here assumed to be identically distributed and independent. The \({\rm pFDR}\) then depends on \(\Gamma\): \[ {\rm pFDR}(\Gamma) = \mathbb{E}\left[\left.\frac{F(\Gamma)}{S(\Gamma)}\right| S(\Gamma) > 0 \right] \,, \] where \(F(\Gamma) := \#\{i : T_i \in \Gamma \wedge H_i = 0 \} \) and \(S(\Gamma) := \#\{i : T_i \in \Gamma\}\). Storey derives that under certain conditions, we can write \[ {\rm pFDR}(\Gamma) = \mathbb{P}[H = 0 | T \in \Gamma] = \frac{\mathbb{E}[F(\Gamma)]}{\mathbb{E}[S(\Gamma)]} \] <br/><h2> The q-value </h2> Let \(\{\Gamma_{\alpha}\}_{\alpha=0}^1\) be a nested family of significance regions. That is \(\Gamma_{\alpha} \subseteq \Gamma_{\alpha'}\) whenever \(\alpha \leq \alpha'\) and \(\mathbb{P}[T \in \Gamma_{\alpha} | H=0] = \alpha\). For instance, if \(T | H = 0 \sim \mathcal{N}(0,1)\), then we could choose \(\Gamma_{\alpha} = [z_{\alpha}, \infty)\), where \(z_{\alpha} = \Phi^{-1}(\alpha)\), with \(\Phi\) the CDF if \(\mathcal{N}(0,1)\). <br />The \(q\)-value of a realization \(t\) of \(T\) is then defined as \[ q(t) = \inf_{\Gamma_{\alpha} | t \in \Gamma_{\alpha}} {\rm pFDR}(\Gamma_{\alpha}) \] We can now give the above-mentioned analogy between \(p\)-values and \(q\)-values. The \(p\)-value is defined as: \[ p(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \mathbb{P}[T \in \Gamma_{\alpha} | H = 0] \] While under the right conditions, the \(q\)-value can be written as: \[ q(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \mathbb{P}[H = 0 | T \in \Gamma_{\alpha} ] \] <h2>Computing q-values</h2> In order to compute \(q\)-values, given a sequence of \(p\)-values, we follow the steps given in <a href="https://doi.org/10.1073/pnas.1530509100">a paper by Storey and Tibshirani</a>In this scenario, the \(p\)-value plays the role of the realization \(t\) of the statistic \(T\). Under the null hypothesis, these \(p\)-values are uniformly distributed. As a family of significance regions, we simply take \(\Gamma_{\alpha} = [0,\alpha]\), and write for instance \(S(\alpha) := S(\Gamma_{\alpha})\). <br/><br/> First, we have to estimate \(\mathbb{E}[S(\alpha)]\), for which we use \(\#\{i : p_i \leq \alpha\}\), and we estimate \(\mathbb{E}[F(\alpha)]\) with \(m \cdot \hat{\pi}_0 \cdot \alpha\), where \(\hat{\pi}_0\) is an estimate for \(\pi_0 = \mathbb{P}[H=0]\). <br/> The most difficult part of computing \(q\)-values is estimating \(\pi_0\). An often used estimate is the average \(p\)-value, but we can do a little bit better by making use of the empirical CDF of the \(p\)-values. The figure below shows a "typical" histogram of \(p\)-values, where the \(p\)-values sampled from the alternative distribution are given by the gray bars. The marginal distribution of \(p\)-values becomes relatively flat towards the right, where most \(p\)-values should come from the null distribution, which is \({\rm Uniform}(0,1)\). The right panel of the figure shows a "rotated" empirical CDF of the \(p\)-values, i.e. \[ x \mapsto R(x) = 1-{\rm CDF}(1-x)\,, \] and the fact that large \(p\)-values should be uniformly distributed is resembled by the fact that \(R(x)\) is a straight line for small \(x\). The slope of this straight line is an estimate of \(\pi_0\). In the C++ code below, I use <a href="https://www.gnu.org/software/gsl/doc/html/index.html">GSL</a> to fit a line though \(R\) (indicated by the red line in the figure), using a weight function \(x \mapsto (1-x)^2\) to give more precedence to small values of \(x\), thereby mostly ignoring the non-straight part of \(R\). <br /> <div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-k6MVAmUzaE8/Wnc4DCMzqLI/AAAAAAAACkw/7RtRWoa8-Ko_IuF_1CCTppQPU1twylafwCLcBGAs/s1600/pvals.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://3.bp.blogspot.com/-k6MVAmUzaE8/Wnc4DCMzqLI/AAAAAAAACkw/7RtRWoa8-Ko_IuF_1CCTppQPU1twylafwCLcBGAs/s400/pvals.png" width="400" height="196" data-original-width="1600" data-original-height="785" /></a><br />In this example, the marginal \(p\)-values are sampled from a mixture distribution \(\pi_0 {\rm Uniform}(0,1) + (1-\pi_0) {\rm Beta}(10,1)\), where \(\pi_0 = 3/4\). </div> <br /><br /> Now we sort the \(p\)-values such that \(p_1 \leq p_2 \leq \dots \leq p_m\), such that \(\#\{j : p_j \leq p_i\} = i\) and first determine the \(q\)-value corresponding to \(p_m\): \[ q_m = \hat{\pi}_0 \cdot p_m \] The \(q\)-value \(q_i\) corresponding to \(p\)-value \(p_i\) is computed as \[ q_i = \min(\hat{\pi}_0 p_i m/i, q_{i+1}) \] Recently, I had to implement this algorithm in C++. The function is given in the following header file, and accepts a "keyed" list of \(p\)-values, where the key is used to identify the feature. The function returns a keyed list of \(q\)-values. <br /><br /> <script src="https://gist.github.com/chvandorp/8ec2489d275a2b3f555c1e3911f9d8eb.js"></script> <br /><br /> The following code gives an example of how to use qvalues.hpp, and can be used for testing: <br /><br /> <script src="https://gist.github.com/chvandorp/06cc3cab9d8547b44ac71386c6e74a11.js"></script> <br /> After compiling and running the program, the result should be something like: <pre><br />$ g++ -O3 -std=c++11 qvalues_test.cpp -lgsl -lgslcblas -o qvaltest<br />$ ./qvaltest 1728<br /> true false<br /> discoveries: 143 32<br /> negatives: 712 113<br />realized FDR: 0.182857<br /></pre> </div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-4118484635242778982017-04-18T10:56:00.000-07:002017-04-19T05:20:28.140-07:00How to calculate WBIC with Stan<div dir="ltr" style="text-align: left;" trbidi="on"> In a <a href="http://tbz533.blogspot.nl/2016/05/computing-bayes-factors-with-stan-and.html">previous post</a>, I showed how to use <a href="http://mc-stan.org/">Stan</a> to calculate the Bayes marginal likelihood with the path sampling method. For large models, this method can be computationally expensive, and therefore estimates for the marginal likelihood have been developed. The most famous one is the Bayesian Information Criterion (BIC), an estimate for the Bayes free energy. Although the name suggests otherwise, the BIC is not a "true Bayesian" estimate, since it depends only on a point estimate, and not the entire posterior distribution: \[ {\sf BIC} = -2 L(\hat{\theta}) + k \cdot \log(n)\,, \] where \(L\) is the log-likelihood function, \(\hat{\theta}\) is the maximum likelihood estimate of the models parameters, \(k\) is the dimension of \(\theta\) and \(n\) is the number of data points. <br /> More importantly, the BIC does not work for singular models. As <a href="http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf">Watanabe points out</a>, if a statistical model contains hierarchical layers or hidden variables then it is singular. Watanabe, therefore, proposes a generalization of the BIC, the WBIC, that works for singular models as well, and is truly Bayesian. This is not unlike AIC versus WAIC. <br /><br /> Suppose that you have some sort of MCMC algorithm to sample from the posterior distribution of your model, given your data. When you want to compute the WAIC, you will just have to store the likelihoods of each observation, given each sampled set of parameters. This can be done during an actual run of the MCMC algorithm. For the WBIC, however, you will need to do a separate run at a different "temperature". In fact, <a href="http://www.jmlr.org/papers/volume14/watanabe13a/watanabe13a.pdf">Watanabe gives the following formula</a>: \[ {\sf WBIC} = -2\mathbb{E}_{\theta}^{\beta}[L(\theta)]\,, \] where \(\beta = 1/\log(n)\), and the notation \(\mathbb{E}_{\theta}^{\beta}[f(\theta)]\) expresses that the expectation of \(f\) is taken with respect to the distribution with PDF \[ \theta \mapsto \frac{\exp(\beta L(\theta)) \pi(\theta)}{\int\exp(\beta L(\theta)) \pi(\theta)d\theta} \] which equals the posterior distribution when \(\beta = 1\) and the prior distribution \(\pi\) when \(\beta = 0\). <br /><br /> Something very similar happens in the <a href="http://tbz533.blogspot.nl/2016/05/computing-bayes-factors-with-stan-and.html">path sampling example</a>, with the exception that the "path" is replaced by a single point. In a recent paper about another alternative to the BIC, the Singular BIC (sBIC), <a href="http://www.rss.org.uk/Images/PDF/publications/Drton-Plummer-Oct-16.pdf">Drton and Plummer</a>point out the similarity to the <a href="https://en.wikipedia.org/wiki/Mean_value_theorem">mean value theorem</a>for integrals. Using the path sampling method, the log-marginal-likelihood is computed from the following integral \[ \log(P(D|M)) = \int_0^1 \mathbb{E}_{\theta}^{\beta}[L(\theta)]\, d\beta\,. \] Here, we write \(P(D|M)\) for the marginal likelihood of model \(M\), and data \(D\). According to the mean value theorem \[ \exists \beta^{\ast} \in (0,1): \log(P(D|M)) = \mathbb{E}_{\theta}^{\beta^{\ast}}[L(\theta)]\,, \] and Watanabe argues that choosing \(1/\log(n)\) for \(\beta^{\ast}\) gives a good approximation. Notice that the mean value theorem does not provide a recipe to compute \(\beta^{\ast}\), and that Watanabe uses rather advanced algebraic geometry to prove that his choice for \(\beta^{\ast}\) is good; the mean value theorem only gives a minimal justification for Watanabe's result. <br /><br /> <h2>Implementing WBIC in Stan</h2> Let us start with a very simple example: the biased coin model. Suppose that we have \(N\) coin flips, with outcomes \(x \in \{0,1\}^N\) (heads = 1 and tails = 0), and let \(K\) denote the number of heads. We compare the null model (the coin is fair) with the alternative model (the coin is biased) using \(\Delta {\sf WBIC}\) and see if we get the "right" answer. Notice that for this simple example, we can compute the Bayes factor exactly. <br /><br /> In order to have a Stan model that can be used for both regular sampling, and estimating WBIC, a sampling <code>mode</code> is passed with the data to determine the desired behavior. In the <code>transformed data</code> block, a parameter <code>watanabe_beta</code> is then defined, determining the sampling "temperature". <br /> The actual model is defined in the <code>transformed parameters</code> block, such that <code>log_likes</code> can be used in both the <code>model</code> block as the <code>generated quantities</code> block. In stead of a sampling statement (<code>x[n] ~ bernoulli(p)</code>), we have to use the <code>bernoulli_lpmf</code> function, which allows us to scale the log-likelihood with <code>watanabe_beta</code> (i.e. <code>target += watanabe_beta * sum(log_likes)</code>). <pre class="brush: cpp"><br />// coin_model.stan<br /><br />data {<br /> int<lower=3> N; // number of coin tosses<br /> int<lower=0, upper=1> x[N]; // outcomes (heads = 1, tails = 0)<br /> int<lower=0, upper=1> mode; // 0 = normal sampling, 1 = WBIC sampling<br />}<br />transformed data {<br /> real<lower=0, upper=1> watanabe_beta; // WBIC parameter<br /> if ( mode == 0 ) {<br /> watanabe_beta = 1.0;<br /> }<br /> else { // mode == 1<br /> watanabe_beta = 1.0/log(N);<br /> }<br />}<br />parameters {<br /> real<lower=0, upper=1> p; // probability of heads<br />}<br />transformed parameters {<br /> vector[N] log_likes;<br /> for ( n in 1:N ) {<br /> log_likes[n] = bernoulli_lpmf(x[n] | p);<br /> }<br />}<br />model {<br /> p ~ beta(1, 1);<br /> target += watanabe_beta * sum(log_likes);<br />}<br />generated quantities {<br /> real log_like;<br /> log_like = sum(log_likes);<br />}<br /></pre> Using the <code>pystan</code> module for Python (3), one could estimate \(p\) from data \(x\) as follows: <pre class="brush:python"><br />import pystan<br />import numpy as np<br />import scipy.stats as sts<br /><br />## compile the model<br />sm = pystan.StanModel(file="coin_model.stan")<br /><br />## create random data<br />N = 50<br />p_real = 0.75<br />x = sts.bernoulli.rvs(p_real, size=N)<br /><br />## prepare data for Stan<br />data = {'N' : N, 'x' : x, 'mode' : 0}<br />pars = ['p', 'log_like']<br /><br />## run the Stan model<br />fit = sm.sampling(data=data)<br />chain = fit.extract(permuted=True, pars=pars)<br />ps = chain['p']<br />print("p =", np.mean(ps), "+/-", np.std(ps))<br /></pre> Which should give output similar to <pre><br />p = 0.67187824104 +/- 0.0650602081466<br /></pre> In order to calculate the WBIC, the <code>mode</code> has to be set to 1: <pre class="brush:python"><br />## import modules, compile the Stan model, and create data as before...<br /><br />## prepare data for Stan<br />data = {'N' : N, 'x' : x, 'mode' : 1} ## notice the mode<br />pars = ['p', 'log_like']<br /><br />## run the Stan model<br />fit = sm.sampling(data=data)<br />chain = fit.extract(permuted=True, pars=pars)<br />WBIC = -2*np.mean(chain["log_like"])<br /><br />print("WBIC =", WBIC)<br /></pre> which should result in something similar to <pre><br />WBIC = 66.036854275<br /></pre> In order to test if this result is any good, we first compute the WBIC of the null model. This is easy since the null model does not have any parameters \[ {\sf WBIC} = -2\cdot L(x) = -2 \cdot \log(\tfrac12^{K}(1-\tfrac12)^{N-K}) = 2N\cdot\log(2) \] and hence, when \(N=50\), we get \({\sf WBIC} \approx 69.3\). Hence, we find positive evidence against the null model, since \(\Delta{\sf WBIC} \approx 3.3\). <br /><br /> For the coin-toss model, we know the posterior density explicitly, namely \[ P(x|M) = \int_0^1 p^K (1-p)^{N-K} dp = B(K+1, N-K+1)\,, \] where \(M\) denotes the alternative (i.e. biased) model, and \(B\) denotes the Beta function. In the above example, we had \(N = 50\) and \(K = 34\), and hence \(-2\cdot\log(P(x|M)) \approx 66.31 \), which is remarkably close to the \({\sf WBIC}\) of the alternative model. Similar results hold for several values of \(p\), as demonstrated by the following figure <div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-9I2_0Iw0FGQ/WPZRko2q3wI/AAAAAAAACe0/ZAmmKfPYkWsGGr8kJA55EY6Hb2maEvf4ACLcB/s1600/wbic_performance.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-9I2_0Iw0FGQ/WPZRko2q3wI/AAAAAAAACe0/ZAmmKfPYkWsGGr8kJA55EY6Hb2maEvf4ACLcB/s400/wbic_performance.png" width="400" height="208" /></a></div> <h2>A non-trivial example</h2> <i> As the example above is non-singular, and WBIC is supposed to work also for singular models, I plan to present an example with a singular model here. </i> <br /></div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-88455896176897805412016-11-22T08:29:00.000-08:002016-11-22T10:36:02.358-08:00Using the "ones trick" to handle unbalanced missing data with JAGS<div dir="ltr" style="text-align: left;" trbidi="on"> The so-called "ones trick" for JAGS and BUGS models allows the user to sample from distributions that are not in the standard list. Here I show another application for "unbalanced" missing data. More examples using the ones (or zeros) trick can be found in <a href="http://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-the-bugs-book/">The BUGS Book</a>, and in particular in <a href="http://www.mrc-bsu.cam.ac.uk/wp-content/uploads/bugsbook_chapter9.pdf">Chapter 9</a>. <br /><br /> Suppose that we want to find an association between a trait \(x\) (possibly continuous) and another categorical trait \(k \in \{1, 2,\dots, K\}\). We have \(N\) observations \(x_i\), however, some of the \(k_i\) are missing. Whenever there is no information on the missing \(k_i\) whatsoever, we can simply write the following JAGS model: <pre class="brush: cpp"><br />data {<br /> /* alpha is the parameter for the Dirichlet prior of p, <br /> * with p the estimated frequency of k<br /> */<br /> for ( j in 1:K ) {<br /> alpha[j] <- 1<br /> }<br />}<br />model {<br /> for ( i in 1:N ) {<br /> /* Some model for x, given k */<br /> x[i] ~ dnorm(xhat[k[i]], tau_x)<br /> /* Sample missing values of k, and inform<br /> * the distribution p with the known k's<br /> */<br /> k[i] ~ dcat(p)<br /> }<br /> /* Model the distribution p_k of the trait k */<br /> p ~ ddirch(alpha)<br /><br /> /* The precision of the likelihood for x */<br /> tau_x ~ dgamma(0.01, 0.01) <br /><br /> /* Give the xhat's a shared prior to regularize them */<br /> for ( j in 1:K ) {<br /> xhat[j] ~ dnorm(mu_xhat, tau_xhat)<br /> }<br /> mu_xhat ~ dnorm(0, 0.01)<br /> tau_xhat ~ dgamma(0.01, 0.01)<br />}<br /></pre> The data file must contain the vector \(k\), with NA in the place of the missing values. <br /><br /> Suppose now that the missing \(k_i\)s are not completely unknown. In stead, suppose that we know that some of the \(k_i\) must belong to a group \(g_i\). The group \(g_i\) is encoded as a binary vector, where a \(g_{i,j} = 1\) indicates that the trait value \(j\) is in the group, and \(0\) that it is not. In particular, when \(k_i\) is known, then \[ g_{i,j} = \left\{ \begin{array}{ll} 1 & {\rm if\ } j = k_i \\ 0 & {\rm otherwise} \end{array} \right. \] If \(k_i\) is completely unknown, then we just let \(g_{i,j} = 1\) for each \(j\). <pre class="brush: cpp"><br />data {<br /> for ( j in 1:K ) {<br /> alpha[j] <- 1<br /> }<br /> /* for the "ones trick", <br /> * we need a 1 for every observation <br /> */<br /> for ( i in 1:N ) {<br /> ones[i] <- 1<br /> }<br />}<br />model {<br /> for ( i in 1:N ) {<br /> x[i] ~ dnorm(xhat[k[i]], tau_x)<br /> /* sample missing k's using the group-vector g */<br /> k[i] ~ dcat(g[i,])<br /> /* in order to inform p correctly,<br /> * we need to multiply the posterior with p[k[i]],<br /> * which can be done by observing a 1, <br /> * assumed to be Bernoulli(p[k[i]]) distributed<br /> */<br /> ones[i] ~ dbern(p[k[i]])<br /> }<br /> p ~ ddirch(alpha)<br /><br /> tau_x ~ dgamma(0.01, 0.01) <br /> for ( j in 1:K ) {<br /> xhat[j] ~ dnorm(mu_xhat, tau_xhat)<br /> }<br /> mu_xhat ~ dnorm(0, 0.01)<br /> tau_xhat ~ dgamma(0.01, 0.01)<br />}<br /></pre> In stead of using the "ones trick", it would have been more clear-cut if we were able to write <pre class="brush: cpp"><br />k[i] ~ dcat(g[i,]) /* sampling statement */<br />k[i] ~ dcat(p) /* statement to inform p */<br /></pre> but in this way JAGS can not tell that it is to sample from \({\rm Categorical}(g_i)\), and not from \({\rm Categorical}(p)\). <br /> Similarly, it might have been tempting to write <pre class="brush: cpp"><br />k[i] ~ dcat(g[i,] * p) /* point-wise vector multiplication */<br /></pre> This would correctly prefer the common \(k\,\)s in group \(g_i\) over the rare \(k\,\)s, but the fact that \(k_i\) must come from the group \(g_i\) is not used to inform \(p\). <br /><br /> As an example, I generated some random \(k_i\)s sampled from \({\rm Categorical}(p)\) (I did not bother to sample any \(x_i\)s). I have taken \(K = 15\), \(N = 2000\), and \(3\) randomly chosen groups. For a \(1000\) observations, I "forget" the actual \(k_i\), and only know the group \(g_i\). The goal is to recover \(p\) and the missing \(k_i\)s. <br /> Using a chain of length \(20000\) (using the first \(10000\) as a burn-in and \(1/10\)-th thinning) we get the following result: <br /> <div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-QQTuKM_o3j0/WDRpxT_pzJI/AAAAAAAACZU/whQ8Kv9mnXg9WPCMO-aJ-hBiVPlrcySuACLcB/s1600/jags-ones-trick.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-QQTuKM_o3j0/WDRpxT_pzJI/AAAAAAAACZU/whQ8Kv9mnXg9WPCMO-aJ-hBiVPlrcySuACLcB/s400/jags-ones-trick.png" width="400" height="400" /></a></div> <br /> <br /></div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-27932429001075179022016-05-05T08:36:00.004-07:002016-11-22T09:15:06.677-08:00Computing Bayes factors with Stan and a path-sampling method<div dir="ltr" style="text-align: left;" trbidi="on"> <a href="http://mc-stan.org/">Stan</a>is a great program for MCMC (or <a href="https://en.wikipedia.org/wiki/Hybrid_Monte_Carlo">HMC</a>, really). Vehtari <i>et al.</i> explain <a href="http://arxiv.org/abs/1507.04544">here</a> how to use Stan to compute WAIC. For the Bayes factor, however, I have not found a method yet, and therefore I would like to demonstrate a possible method here. This will obviously not work well for every model; this is merely an experiment. <br/> Recently, I was really intrigued by a paper by <a href="https://projecteuclid.org/euclid.ss/1028905934">Gelman and Meng</a>, where several methods for computing Bayes factors, or normalizing constants, are explained and connected (even the <a href="https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/">really bad ones</a>). Here, I will use the <i>path sampling</i> method. <br/> Let us implement a simple model in Stan, for which we can explicitly compute the marginal likelihood. Then we can try to estimate this marginal likelihood with the path-sampling method, and compare it with the exact value. <br/> A very simple model is the 'fair coin' example (taken directly from <a href="https://en.wikipedia.org/wiki/Bayes_factor">wikipedia</a>). The Bayes factor between a null-model \(M_0\) and a model that incorporates a bias, \(M_1\), can be computed directly as the quotient of the marginal likelihoods, and moreover, the null model does not have any parameters. <br/> Let \(n\) denote the number of coin tosses, and \(k\) the number of 'heads'. Hence, the data \(D\) is given by the pair \((n,k)\). Given a prior \(\theta \sim {\rm Beta}(\alpha, \beta)\) on the probability of throwing heads, we get the posterior \(\theta \sim {\rm Beta}(\alpha + k, \beta +n-k)\), and we can compute the marginal likelihood exactly: <br/> \[ p(D|M_1) = \int_0^1 p(D|\theta)\pi(\theta) d\theta \] \[ = {n \choose k}\int_0^1 \theta^{k+\alpha-1}(1-\theta)^{n-k+\beta-1} d\theta = {n \choose k} B(k+\alpha, n-k+\beta)\,, \] <br/> where \(B\) denotes the Beta function. Meanwhile, <br/> \[ p(D | M_0) = {n \choose k} \left(\tfrac12\right)^k\left(1-\tfrac12\right)^{n-k}\,. \] <br/> In this instance of path sampling (and closely following <a href="https://projecteuclid.org/euclid.ss/1028905934">Gelman and Meng</a>), we consider a family of (un-normalized) distributions \(Q_T\), indexed by a parameter \(T \in [0,1]\), such that \(Q_0(\theta) = \pi(\theta)\) and \(Q_1(\theta) = p(D|\theta)\pi(\theta)\). The normalizing constants are denoted by \(z(T)\). Notice that \(z(0) = 1\) and \(z(1) = p(D|M_1)\). <br/> Let \(\Theta = [0,1]\) denote the support of \(\theta\). Since <br/> \[ \frac{d}{dT} \log z(T) = \frac{1}{z(T)} \frac{d}{dT} z(T) = \frac{1}{z(T)} \frac{d}{dT} \int_{\Theta} Q_T(\theta) d\theta\,, \] <br/> we get that <br/> \[ \frac{d}{dT} \log z(T) = \int_{\Theta} \frac{1}{z(T)} \frac{d}{dT} Q_T(\theta) d\theta\,, \] <br/> and hence <br/> \[ \frac{d}{dT} \log z(T) = \int_{\Theta} \frac{Q_T(\theta)}{z(T)} \frac{d}{dT} \log(Q_T(\theta)) d\theta\,. \] <br/> When we denote by \(\mathbb{E}_T\) the expectation under \(P_T\), we get that <br/> \[ \frac{d}{dT} \log z(T) = \int_{\Theta} P_T(\theta) \frac{d}{dT} \log(Q_T(\theta)) d\theta = \mathbb{E}_T\left[ \frac{d}{dT} \log(Q_T(\theta)) \right]\,. \] <br/> We can think of \(U(\theta, T) := \frac{d}{dT} \log(Q_T(\theta))\) as 'potential energy', and we get <br/> \[ \int_0^1 \mathbb{E}_T\left[U(\theta, T)\right] dT = \int_0^1 \frac{d}{dT} \log(z(T)) dT \] \[ = \log(z(1)) - \log(z(0)) = \log(z(1)/z(0)) =: \lambda\,. \] <br/> Notice that in our case \(\lambda = \log(P(D|M_1))\). We can interpret \(\int_0^1 \mathbb{E}_T\left[U(\theta, T)\right] dT\) as the expectation of \(U\) over the joint probability density of \(T\) (with a uniform prior) and \(\theta\): <br/> \[ \lambda = \mathbb{E}\left[ U(\theta, T)\right]\,. \] <br/> This suggests an estimator \(\hat{\lambda}\) for \(\lambda\): <br/> \[ \hat{\lambda} = \frac{1}{N} \sum_{i=1}^N U(\theta_i, T_i)\,, \] <br/> where \((\theta_i, T_i)_i\) is a sample from the joint distribution of \(\theta\) and \(T\). A way of creating such a sample, is first sampling \(T_i\) from its marginal (uniform) distribution, and then sampling \(\theta_i\) from \(P_{T_i}\). This last step <i>might</i> require some Monte Carlo sampling. <br/> First, we need to choose \(1\)-parameter family of distributions. A simple choice is the <i>geometric path</i>: <br/> \[ Q_T(\theta) = \pi(\theta)^{1-T} (\pi(\theta)p(D|\theta))^{T} = \pi(\theta) p(D|\theta)^T\,. \] <br/> In this case, the potential energy simply equals \(\frac{d}{dT}\log(Q_T(\theta)) = \log p(D|\theta)\) <br/> <h2>The Stan model</h2>Using the <a href="http://pystan.readthedocs.io/en/latest/">pystan</a> interface, we can implement the model as follows. The most important parts are the parameter \(T\) (declared in the data section), and the "generated quantity" \(U\). <pre class="brush: python"><br />## import some modules<br />import pystan<br />import scipy.stats as sts<br />import scipy.special as spl<br />import numpy as np<br />import multiprocessing<br /><br />## define a Stan model<br />model = """<br />data {<br /> int<lower=0> n;<br /> int<lower=0, upper=n> k;<br /> real<lower=0> alpha;<br /> real<lower=0> beta;<br /> real<lower=0, upper=1> T; // parameter for path sampling<br />}<br />parameters {<br /> real<lower=0, upper=1> theta;<br />}<br />model {<br /> theta ~ beta(alpha, beta);<br /> increment_log_prob(T*binomial_log(k, n, theta));<br /> // replaces sampling statement "k ~ binomial(n, theta)"<br />}<br />generated quantities {<br /> real U;<br /> U <- binomial_log(k, n, theta);<br />}<br />"""<br /><br />## let Stan translate this into C++, and compile...<br />sm = pystan.StanModel(model_code=model)<br /></pre> <h2> A parallel method </h2> We need to generate samples \(T_i\) from \({\rm Uniform}(0,1)\), and then, given \(T_i\) we generate a sample \(\theta_i\) from \(P_{T_i}\). The simplest way is just to make a partition \(T_i = \tfrac{i}{N}\) of \([0,1]\) and then for each \(i=0,\dots,N\), use the Stan model with \(T = T_i\). Notice that for each \(i\), we will generate multiple (\(K\), say) samples from \(P_{T_i}\). This method lends itself well for multi-processing, as all \(N+1\) Stan sessions can run in parallel. <pre class="brush: python"><br />## choose some parameters<br />k = 10 ## heads<br />n = 100 ## coin tosses<br />alp = 1 ## determines prior for q<br />bet = 1 ## determines prior for q<br />K = 100 ## length of each chain<br />N = 1000 ## number of Ts<br /><br />## a function that prepares a data dictionary,<br />## and then runs the Stan model<br />def runStanModel(T):<br /> coin_data = {<br /> 'n' : n, <br /> 'k' : k, <br /> 'alpha' : alp,<br /> 'beta' : bet,<br /> 'T' : T<br /> }<br /> fit = sm.sampling(data=coin_data, iter=2*K, <br /> warmup=K, chains=1) <br /> la = fit.extract(permuted=True)<br /> return la['U'] ## U is a "generated quantity"<br /><br />## make a partition of [0,1] <br />Ts = np.linspace(0, 1, N+1)<br />## start a worker pool<br />pool = multiprocessing.Pool(4) ## 4 threads<br />## for each T in Ts, run the Stan model<br />Us = np.array(pool.map(runStanModel, Ts))<br /></pre> Let's have a look at the result. Notice that for \(\alpha=\beta=1\), the marginal likelihood does not depend on \(k\) as \[ P(D|M_1) = {n \choose k} \frac{\Gamma(k+1)\Gamma(n-k+1)}{\Gamma(n+2)} = {n \choose k} \frac{k! (n-k)!}{(n+1)!} = \frac{1}{n+1} \] We could take for \(\hat{\lambda}\) the average of <i>all</i> \((N+1)\cdot K\) samples, but in my experience, the standard error is more realistic when I only take one sample per \(T_i\). <pre class="brush: python"><br />## take one sample for each T<br />lamhat = np.mean(Us[:,-1]) <br />## we can also compute a standard error!!<br />se_lamhat = sts.sem(Us[:,-1])<br /><br />print "extimated lambda = %f +/- %f"%(lamhat, se_lamhat)<br />print "estimated p(D|M_1) = %f"%np.exp(lamhat)<br /><br />exactMargLike = spl.beta(k+alp, n-k+bet) * spl.binom(n,k)<br />exactMargLoglike = np.log(exactMargLike)<br /><br />print "exact lambda = %f"%exactMargLoglike<br />print "exact p(D|M_1) = %f"%exactMargLike<br /></pre>In my case, the result is <pre><br />estimated lambda = -4.724850 +/- 0.340359<br />estimated p(D|M_1) = 0.008872<br />exact lambda = -4.615121<br />exact p(D|M_1) = 0.009901<br /></pre> <h2> A serial method </h2> Another method does not use parallel processing, but uses the fact that the distributions \(P_{T_i}\) and \(P_{T_{i+1}}\) are very similar when \(T_{i}-T_{i+1} = \frac{1}{N}\) is small. When we have a sample from \(P_{T_i}\), we can use it as the initial condition for the Stan run with \(T = T_{i+1}\). We then only need very little burn-in (warm-up) time before we are actually sampling from \(P_{T_{i+1}}\). We can specify the number of independent chains that Stan computes, and also separate initial parameters for each of the chains. Hence, we can take multiple samples \(P_{T_i}\) as initial choices for the next chain. For this very simple model, this "serial" method is much slower than the parallel method, but my guess is that it could be a lot faster for more complicated models. I hope to prove this claim in a future post. <pre class="brush: python"><br />## choose some parameters<br />k = 10 ## heads<br />n = 100 ## coin tosses<br />alp = 1 ## determines prior for q<br />bet = 1 ## determines prior for q<br />K = 100 ## size initial chain<br />N = 200 ## number of Ts<br /><br />## initially, do a longer run with T=0<br />coin_data = {<br /> 'n' : n, <br /> 'k' : k, <br /> 'alpha' : alp,<br /> 'beta' : bet,<br /> 'T' : 0<br />}<br />fit = sm.sampling(data=coin_data, iter=2*K, <br /> warmup=K, chains=1)<br />la = fit.extract(permuted=True)<br /><br />## in stead of length K, <br />## now use a much shorter chain (of length L)<br />L = 10 <br />chains = 4 <br />Us = np.zeros(shape=(N+1,L*chains))<br />Ts = np.linspace(0, 1, N+1)<br /><br />## now run the 'chain of chains'<br />for i, Ti in enumerate(Ts):<br /> coin_data['T'] = Ti ## take another T<br /> ## take some thetas from the previous sample<br /> thetas = np.random.choice(la["theta"], chains)<br /> initial_guesses = [{'theta' : theta} for theta in thetas]<br /> fit = sm.sampling(data=coin_data, iter=2*L, warmup=L, <br /> chains=chains, init=initial_guesses)<br /> la = fit.extract(permuted=True)<br /> Us[i,:] = la['U']<br /></pre> Ok, let us have another look at the result. For this, I used the same code as above: <pre><br />estimated lambda = -5.277354 +/- 1.120274<br />estimated p(D|M_1) = 0.005106<br />exact lambda = -4.615121<br />exact p(D|M_1) = 0.009901<br /></pre> As you can see, the estimate is less precise, but this is due to the fact that \(N=200\) instead of \(1000\). <br/> I've written in, and I ran the above code fragments from a <a href="http://jupyter.org/">Jupyter notebook</a>. As compiling and sampling can take a lot of time, such an interface can be very convenient. Please let me know when this was somehow useful for you or if you have any questions, and also please tell me if I did something stupid... <br /></div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-5113387799221052722016-05-02T11:29:00.000-07:002016-05-02T12:33:17.374-07:00A Sunset Colormap for Python<div dir="ltr" style="text-align: left;" trbidi="on">Room Z533 in the Kruyt building in Utrecht can have a wonderful view, especially at sunset. Not too long ago, I took the following photograph.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-ozJCMIIpwU0/VyeM_pqEmHI/AAAAAAAACRI/iVBiupe3O_IXoEao_DhjXJ0TkgYVnLXXwCLcB/s1600/view-uithof.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://2.bp.blogspot.com/-ozJCMIIpwU0/VyeM_pqEmHI/AAAAAAAACRI/iVBiupe3O_IXoEao_DhjXJ0TkgYVnLXXwCLcB/s400/view-uithof.jpg" width="400" /></a></div>Having a love-hate relationship with <a href="http://matplotlib.org/examples/color/colormaps_reference.html" target="_blank">colormaps</a>, and in particular <a href="http://matplotlib.org/users/colormaps.html" target="_blank">choosing</a> a good one, I INSTANTLY noticed the beauty of the gradient of the sky, and hence, its application as a colormap.<br /><br />In Python, it is easy to import figures as a matrix, and then extract the RGB values along a line. To make things easier, I first cropped out the part of the sky I wanted:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-Z2J-AigRDXE/VyeQHtwy72I/AAAAAAAACRU/QoSzD4wBKKQt9z56DjA-EYr-uJgLkHWHgCLcB/s1600/uithof-sunset2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://2.bp.blogspot.com/-Z2J-AigRDXE/VyeQHtwy72I/AAAAAAAACRU/QoSzD4wBKKQt9z56DjA-EYr-uJgLkHWHgCLcB/s200/uithof-sunset2.jpg" width="12" /></a></div>Then, I had a look at <a href="http://matplotlib.org/examples/pylab_examples/custom_cmap.html" target="_blank">this</a> website.<br />Let's import the "slice of sky" in Python and have a look:<br /> <pre class="brush: python"><br />import matplotlib.pyplot as plt<br />import numpy as np<br />from matplotlib.colors import LinearSegmentedColormap<br />from PIL import Image<br /><br />im = Image.open("/path/to/sunset.jpg")<br />pix = im.load()<br /><br />xs = range(im.size[1])<br />y = im.size[0]/2<br />rs = [pix[y,x][0] for x in xs]<br />gs = [pix[y,x][1] for x in xs]<br />bs = [pix[y,x][2] for x in xs]<br /><br />fig = plt.figure(figsize=(5,2))<br />ax = fig.add_subplot(111)<br /><br />ax.plot(xs, rs, color='red')<br />ax.plot(xs, gs, color='green')<br />ax.plot(xs, bs, color='blue')<br />ax.set_xlabel("pixel")<br />ax.set_ylabel('r/g/b value')<br /><br />plt.savefig('rgb-plot.png', dpi=300, bboxinches='tight')<br /></pre><br />This results in the following figure: <br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-mK8YnkLFYHg/VyeV3ju9mbI/AAAAAAAACRk/U6Y5jJwcHm073khQZAGUIqTE_4xh_NXegCKgB/s1600/rgb-plot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-mK8YnkLFYHg/VyeV3ju9mbI/AAAAAAAACRk/U6Y5jJwcHm073khQZAGUIqTE_4xh_NXegCKgB/s320/rgb-plot.png" /></a></div><br /></div> Now let's make the actual colormap. <br /> <pre class="brush: python"><br />Dp = 1 ## determines the number of pixels between "nodes"<br />xs = np.linspace(0,1,im.size[1]/Dp + 1) ## points between 0 and 1 <br />idxs = range(0,im.size[1],Dp) ## indices in the original picture matrix<br /><br />redlist = [rs[idx]/255. for idx in idxs] + [rs[-1]/255., 0]<br />greenlist = [gs[idx]/255. for idx in idxs] + [gs[-1]/255., 0]<br />bluelist = [bs[idx]/255. for idx in idxs] + [bs[-1]/255., 0]<br /><br />## LinearSegmentedColormap wants these weird triples, <br />## where some end points are ignored...<br /><br />redtuples = [(x, redlist[i], redlist[i+1]) for i, x in enumerate(xs)]<br />greentuples = [(x, greenlist[i], greenlist[i+1]) for i, x in enumerate(xs)]<br />bluetuples = [(x, bluelist[i], bluelist[i+1]) for i, x in enumerate(xs)]<br /><br />cdict = {'red' : redtuples, 'green' : greentuples, 'blue' : bluetuples}<br /><br />cmap = LinearSegmentedColormap('tbz533', cdict) ## choose a name<br />plt.register_cmap(cmap=cmap)<br />## now we can use our new colormap!<br /></pre> <br />OK. Let's make some 70s wallpaper to celebrate the new colormap: <br /> <pre class="brush: python"><br />xs = np.linspace(-5,5,1000)<br />ys = np.linspace(-5,5,1000)<br />zs = np.array([[np.sin(x)*np.cos(y) for x in xs] for y in ys])<br /><br />fig = plt.figure(figsize=(5.5,5))<br />ax1 = fig.add_subplot(1,1,1)<br />C = ax1.pcolormesh(xs, ys, zs, cmap='tbz533')<br />fig.colorbar(C)<br /><br />fig.savefig('my70swallpaper.png', dpi=200, bboxinches='tight')<br /></pre> <br />This results in the following figure: <br /> <div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Oo4EwHk14UQ/VyecOKo7wEI/AAAAAAAACSA/4QleyYXUo7MwlrMRXXxgG2s12owx_DMqQCLcB/s1600/KruytZ533_example.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-Oo4EwHk14UQ/VyecOKo7wEI/AAAAAAAACSA/4QleyYXUo7MwlrMRXXxgG2s12owx_DMqQCLcB/s320/KruytZ533_example.png" /></a></div> <br />That's it! Please leave your own favorite colormap in the comments.Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-330392583210748222015-11-06T15:13:00.001-08:002016-05-02T11:58:01.857-07:00ETA for C++<div dir="ltr" style="text-align: left;" trbidi="on"> Simulations can take some time, and I'd like to know how long. This is easy, right? Yes, it is. I've done it lots of times, but every time I do, I curse myself for not using an old piece of code. <br />most likely, there is some standard, best way of doing this, but I haven't found it. Most recently, I did this: I made a simple object "EtaEstimator", that can be updated every (costly) time step and asked for an estimated time of "arrival" at any time. Here's the header: <pre class="brush: cpp"><br />// eta.hpp<br />#include <ctime><br />#include <cmath> // floor<br />#include <iostream><br /><br />class EtaEstimator {<br />public:<br /> EtaEstimator(int ); <br /> // constuction starts the clock. Pass the number of steps<br /> void update();<br /> void print(std::ostream & ) const;<br />private:<br /> double ct, etl; // cumulative time, estimated time left<br /> int n, N; // steps taken, total amount of steps<br /> clock_t tick; // time after update ((c) matlab)<br /> // statics...<br /> static const int secperday = 86400;<br /> static const int secperhour = 3600;<br /> static const int secperminute = 60;<br />};<br /><br />std::ostream & operator<<(std::ostream & , const EtaEstimator & );<br /></pre> The members are straight forward too: <pre class="brush: cpp"><br />// eta.cpp<br />#include "eta.hpp"<br /><br />EtaEstimator::EtaEstimator(int N) :<br /> ct(0.0), etl(0.0), n(0), N(N) {<br /> tick = clock();<br />}<br /><br />void EtaEstimator::update() {<br /> clock_t dt = clock() - tick;<br /> tick += dt;<br /> ct += (double(dt)/CLOCKS_PER_SEC); // prevent integer division<br /> // CLOCKS_PER_SEC is defined in ctime<br /> ++n;<br /> etl = (ct/n) * (N-n);<br />}<br /><br />void EtaEstimator::print(std::ostream & os) const {<br /> double etlprime = etl;<br /> int days = floor(etlprime / secperday);<br /> etlprime -= days * secperday;<br /> int hours = floor(etlprime / secperhour); <br /> etlprime -= hours * secperhour;<br /> int minutes = floor(etlprime / secperminute);<br /> etlprime -= minutes * secperminute;<br /> int seconds = floor(etlprime);<br /> os << (days > 0 ? std::to_string(days) + " " : "")<br /> << hours << ":" <br /> << (minutes < 10 ? "0" : "") << minutes << ":" <br /> << (seconds < 10 ? "0" : "") << seconds;<br />}<br /><br />std::ostream & operator<<(std::ostream & os,<br /> const EtaEstimator & eta) {<br /> eta.print(os);<br /> return os;<br />}<br /></pre> Typical usage of EtaEstimator would be the following: <pre class="brush: cpp"><br />#include <iostream><br />#include "eta.hpp"<br /><br />// about to do lots of work...<br />int N = 1000;<br />EtaEstimator eta(N);<br />for ( int n = 0; n < N; ++n ) {<br /> // do something very expensive<br /> eta.update()<br /> std::cout << "\rETA: " << eta << std::flush;<br />}<br />// ...<br /></pre> PS: std::to_string is a C++11 feature, and can be ignored by using something like <pre class="brush: cpp"><br />if ( days > 0 ) os << days << " "; // else nothing at all<br /></pre> <br /></div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-2793043206232117222015-10-26T03:14:00.000-07:002016-05-02T12:43:32.836-07:00Carnes's life span distribution<div dir="ltr" style="text-align: left;" trbidi="on"> In a paper by <a href="http://www.ncbi.nlm.nih.gov/pubmed/16732401">Carnes <i>et. al</i></a>, a simple parametric, but realistic life span distribution is given, and here I show how you can sample from it. In addition, assuming a demographic equilibrium, the age of individuals will have a particular distribution. I will show what this distribution is, and again how to sample from it. Sampling ages instead of lifespans might be useful for initiating simulations. I model epidemics, and I want my disease free (a.k.a. virgin) population to have the 'right' age distribution. <br/> The life span distribution has hazard \[\lambda(a) = e^{u_1 a + v_1} + e^{u_2 a + v_2}\,.\] Typical parameters are given by \(u_1 = 0.1\), \(v_1 = -10.5\), \(u_2 = -0.4\), and \(v_2 = -8\), so that infants have a slightly increased hazard of dying, and after the age of 60, the hazard rapidly starts to grow, until it becomes exceedingly large around \(a = 100\). <br/> The survival function \(S(a) = {\rm Pr}(A > a)\), where \(A\) is the random variable denoting 'age at death' is given by \(S(a) = e^{-\Lambda(a)}\), with \(\Lambda(a) := \int_0^a \lambda(a')da'\) denoting the cumulative hazard. The cumulative hazard \(\Lambda\) is easily calculated: \[ \Lambda(a) = \frac{e^{v_1}}{u_1}(e^{u_1 a}-1) + \frac{e^{v_2}}{u_2}(e^{u_2 a}-1)\,, \] but its inverse, or the inverse of the survival function is more difficult to calculate. <br/> We need the inverse of \(S\), because sampling random deviates typically involves uniformly sampling a number \(u\in [0,1)\). The number \(S^{-1}(u)\) is then the desired deviate. <br/><br/> <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-aA9V6tmFWLo/Vi5qLXOznLI/AAAAAAAACLk/fxSm-2omRRs/s1600/carnes.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-aA9V6tmFWLo/Vi5qLXOznLI/AAAAAAAACLk/fxSm-2omRRs/s320/carnes.png" /></a></div> In a future post, I will show how to use the <a href="http://www.gnu.org/software/gsl/">GNU Scientific Library (GSL)</a> to sample deviates from \(A\). </br></br> Suppose that the birth rate \(b\) in our population is constant. A PDE describing the population is given by \[\frac{\partial N(a,t)}{\partial t} + \frac{\partial N(a,t)}{\partial a} = -\lambda(a)N(a,t)\,,\] where \(N(a,t)\) is the number (density) of individuals of age \(a\), alive at time \(t\). The boundary condition (describing birth) is given by \[N(0,t) = b\] When we assume that the population is in a demographic equilibrium, the time derivative with respect to \(t\) vanishes, and we get an ODE for the age distribution: \[\frac{\partial N(a)}{\partial a} = -\lambda(a) N(a)\,,\quad N(0) = b\,,\] where we omitted the variable \(t\). This equation can be solved: \[\frac{1}{N}\frac{\partial N}{\partial a} = \frac{\partial \log(N)}{\partial a} = -\lambda(a) \implies N(a) = c \cdot e^{-\Lambda(a)}\] for some constant \(c\). Since \(b = N(0) = c \cdot e^{-\Lambda(0)} = c\), we have \[N(a) = b\cdot e^{-\Lambda(a)}\,.\] Hence, we now know the PDF of the age distribution (up to a constant). Unfortunately, we can't get a closed form formula for the CDF, let alone invert it. Therefore, when we want to sample, we need another trick. I've used a method from <a href="http://numerical.recipes/">Numerical Recipies in C</a>. It involves finding a dominating function of the PDF, with an easy, and easily invertible, primitive. </br>Let's just assume that \(b = 1\), so that the objective PDF is \(N(a) = e^{-\Lambda(a)}\). Please notice that \(N\) is not a proper PDF, since, in general, the expectation won't be \(1\). We need to find a simple, dominating function for \(N\). A stepwise defined function might be a good choice, since the hazard is practically zero when the age is below \(50\), and then increases rapidly. We first find a comparison cumulative hazard \(\tilde{\Lambda}\) that <i>is dominated by</i> the actual cumulative hazard \(\Lambda\). Many choices are possible, but one can take for instance \[ \tilde{\Lambda}(a) = \left\{ \begin{array}{ll} 0 & \mbox{if } a < a_0 \\ \lambda(a^{\ast})\cdot (a-a_0) & \mbox{otherwise} \end{array}\right. \] where \[ a_0 = a^{\ast} - \frac{\Lambda(a^{\ast})}{\lambda(a^{\ast})}\,. \] The constant \(a_0\) is chosen such that the cumulative hazards \(\Lambda\) and \(\tilde{\Lambda}\) are tangent at \(a^{\ast}\). </br> Since \(\Lambda\) dominates \(\tilde{\Lambda}\), the survival function \(\tilde{S}\) defined by \(\tilde{S}(a) = e^{-\tilde{\Lambda}(a)}\) dominates \(S\). It is easy to find the inverse of \(a\mapsto\int_0^a \tilde{S}(a')da'\), and hence we can easily sample random deviates from the age distribution corresponding to \(\tilde{\Lambda}\). In order to sample from the desired age distribution, one can use a <i>rejection method</i>: (i) sample an age \(a\) from the easy age distribution. (ii) compute the ratio \(r = S(a)/\tilde{S}(a) \leq 1\). (iii) sample a deviate \(u \sim \mbox{Uniform}(0,1)\). (iv) <i>accept</i> the age \(a\) when \(u \leq r\), and <i>reject</i> \(a\) otherwise. Repeat these steps until an age \(a\) was accepted. <br/><br/> The only thing we still need to do, is to find a good value for \(a^{\ast}\). To make the sampler as efficient as possible, we want to minimize the probability that we have to reject the initially sampled age \(a\) from \(S\). This boils down to minimizing \[\int_0^{\infty} \tilde{S}(a)da = a_0 + \frac{1}{\lambda(a^{\ast})} = a^{\ast} + \frac{1 - \Lambda(a^{\ast})}{\lambda(a^{\ast})}\,. \] The derivative of \(a^{\ast} \mapsto \int_0^{\infty} \tilde{S}(a)da\) equals \[ \frac{(\Lambda(a^{\ast}) - 1)\lambda'(a^{\ast})}{\lambda(a^{\ast})^2} \] and thus, we find an extreme value for \(\int_0^{\infty} \tilde{S}(a)da\) when \(\Lambda(a^{\ast}) = 1\) or when \(\lambda'(a^{\ast}) = \frac{d\lambda}{da^{\ast}} = 0\). The second condition can only correspond to a very small \(a^{\ast}\), and therefore will not minimize \(\int_0^{\infty} \tilde{S}(a)da\). Hence, we have to solve \(\Lambda(a^{\ast}) = 1\). When we ignore the second term of \(\Lambda\), we find that: \[ a^{\ast} = \log(1 + u_1 \exp(-v_1))/u_1\] <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-rTPrU1Cleqo/VkhkQzmc2aI/AAAAAAAACMg/ZNPJr0DiQ_g/s1600/carnes-dom.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-rTPrU1Cleqo/VkhkQzmc2aI/AAAAAAAACMg/ZNPJr0DiQ_g/s320/carnes-dom.png" /></a></div><br/></div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-88715191374758001272015-09-18T03:17:00.001-07:002016-05-02T11:58:01.862-07:00Basic statistics using higher-order functions in C++<div dir="ltr" style="text-align: left;" trbidi="on">I do a lot of individual-based simulations, and often I want to calculate some population statistics 'on the fly'. I found that it can be helpful to use C/C++'s basic functional capabilities. <br /><br />A higher-order function is a function that takes other functions as arguments (or returns a function as a result). In C/C++, functions can be passed to other functions, but the notation can be a bit awkward, as one needs to pass a reference to a function. If the function is a method of some class, then the notation gets even more involved. You can make your life easier by using a typedef.<br /><br />The following code snippet shows the header file of a simple example. The goal is to calculate some statistics on a population of "Critters". These Critters have multiple "traits", and the traits are accessed by methods of the class Critter of signature "double Critter::trait() const". Suppose that we want to calculate the mean of the trait "happiness". It's trivial to write a function that does this, but then we might also get interested in the average "wealth". The function that calculates the average wealth is identical to the function that calculates average happiness, except that happiness is replaced by wealth. We can get rid of this redundancy by defining the typedef Trait as a method of Critter, and writing a function that takes the average of an <i>arbitrary</i> Trait.<br /><br /><script src="https://gist.github.com/chvandorp/e765e161d37e76ad4493.js"></script>Let us now look at the source file. The most important things to notice are... <br />(1) whenever you pass a member "trait" (e.g. wealth) of Critter to a function, you should pass it as "&Critter::trait" (i.e. pass the reference).<br />(2) when you want to evaluate Trait x of Critter alice, you'll need to de-reference x, and call the resulting function: "(alice.*x)()" <br /><br /><script src="https://gist.github.com/chvandorp/f2667a5435daa2210b32.js"></script>If you want to play with this example, put the header in a file called "main.hpp", and the source in a file called "main.cpp", and compile main.cpp by typing "g++ main.cpp -o critters" in your terminal (I assume that you are using Linux and have the gcc compiler installed). <br /></div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-67419507366385682152015-07-10T07:20:00.001-07:002016-05-02T11:57:31.245-07:00Blob plots, like violin plots, but more preciseViolin plots are beautiful, useful and much clearer than drawing several histograms on the same plot.<br />In my opinion, the smoothing function that they implement with python does not really compensate the loss of precision with the visual appealing.<br />Why not plotting the whole distribution, then?<br /><br />I modified the violin plots source code that I found here <a href="http://pyinsci.blogspot.nl/2009/09/violin-plot-with-matplotlib.html">HERE</a> <br /><br />And produced... let's called them blob plots:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-YSAh0mE6NBM/VZ_T6fK551I/AAAAAAAAAMo/TIwA5j7JG3Y/s1600/blobplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="http://4.bp.blogspot.com/-YSAh0mE6NBM/VZ_T6fK551I/AAAAAAAAAMo/TIwA5j7JG3Y/s400/blobplot.png" width="400" /></a></div><br />Here is the code.<br /><br /><script src="https://gist.github.com/chvandorp/a436afffa5d7fe8f18fd.js"></script>Feel free to take it and do whatever you want with it.<br />If you have ideas to improve it let me know!<br />If you like the result, go say thanks to the violin plot maker (I did very little)Enrico Sandro Colizzinoreply@blogger.com0tag:blogger.com,1999:blog-8385427779430646568.post-8411271404113128472015-06-26T07:32:00.000-07:002016-05-02T11:58:01.859-07:00A simple class wrapper for parallelism in C++<div dir="ltr" style="text-align: left;" trbidi="on">Concurrency can be extremely complicated, and causes problems that will <a href="https://en.wikipedia.org/wiki/Concurrency_(computer_science)">haunt you in your dreams</a>. The classical libraries in C/C++ don't protect you from this horror in any way, and you will have to figure it out yourself. Parallelism is supposed to be a lot easier, but C/C++ does not have standard libraries like---for instance---Pythons parallelism modules. The <a class="wiki_link_ext" href="http://www.boost.org/" rel="nofollow">boost</a> libraries, however, provide an extensive interface for concurrency and parallelism.<br />If you don't want to use boost, don't panic. There are other options. The POSIX <a class="wiki_link_ext" href="https://en.wikipedia.org/wiki/POSIX_Threads" rel="nofollow">pthread</a> library provides a down-to-the-metal concurrency model. It took me a while to find out what it is all about, and I haven't successfully applied all it's possibilities. What I have managed to apply, is a so-called "worker-pool" model. This is one of the easier concurrency applications (it falls under the parallelism category) of the pthread library, but can be quite useful. Here I will demonstrate a "wrapper" that C++ classes can inherit.<br /><br />Suppose that you have a class Fun, that needs to do some computations. We declare Fun in---say---Fun.hpp: <br /><script src="https://gist.github.com/chvandorp/33b39d004d0026873a10.js"></script> Fun inherits the worker pool functionality from the class "WorkerPoolWrapper" (imported from WorkerPoolWrapper.hpp). The class WorkerPoolWrapper has a "pure virtual" member executeJob(void* ). You, the user, must specify what you want to compute in your own definition of the executeJob method. Besides implementing executeJob, you must also initiate the worker pool, and somewhere before Fun's destuctor returns, the threads must be "joined", i.e., all worker threads must finish their execution. In this case, I use the constructor and destructor of Fun to accomplish these things: <br /><script src="https://gist.github.com/chvandorp/e9276effd64d38202f01.js"></script> The methods initWorkerPool and waitForWorkerPoolToExit are inherited from WorkerPoolWrapper. Lets use Fun to compute the number of primes pi(n) below a number n. We overload operator() as follows: <br /><script src="https://gist.github.com/chvandorp/3836891a5d9296c36782.js"></script> Notice that this implementation of pi(n) is not the most efficient one. It checks for every integer i between 0 and n whether i is prime or not. This prime test is performed by executeJob. In the background, WorkerPoolWrapper has a list of jobs that have to be executed. Jobs can be added to this job list using addNewJob(void* ). Once executed, the result of a job must somehow be stored in the job again. Above, the number pi is the sum of the array ns, which makes sense when we look at the implementation of executeJob: <br /><script src="https://gist.github.com/chvandorp/d97ded1320d7483bd0de.js"></script> Hence, executeJob transforms the number i pointed to by job into 0 if i is composit, or 1 if i is prime, such that the sum of the i's equals pi(n). Before we gathered the results in Fun::operator(), we called syncWorkerThreads(). This method lets the program halt until every job in the job list has been executed. <br />Using the functionoid Fun now works as follows: <br /><script src="https://gist.github.com/chvandorp/8e7630e0252301976f8c.js"></script> The class WorkerPoolWrapper is declared here: <br /><script src="https://gist.github.com/chvandorp/2d1e15f49a442e5b5fbf.js"></script> And the members are defined here: <script src="https://gist.github.com/chvandorp/b54fb35ebdda96794df5.js"></script> The credits for combining pthread with C++ classes go to <a href="http://stackoverflow.com/users/131930/jeremy-friesner">Jeremy Friesner</a>. </div>Christiaan van Dorphttps://plus.google.com/104659278771258804795noreply@blogger.com0