Sunday, 21 July 2019

Neat encoding of censored data in Stan

Suppose that some of the observations in your data set are left- or right-censored. A way to handle such data in likelihood-based models is to integrate out the censored observations. In that case you have to replace the likelihood of an observation given the parameters with a cumulative likelihood. In Stan, you can replace a sampling statement as X ~ normal(mu, sigma) with target += normal_lcdf(X | mu, sigma) when \(X\) is in fact the upper bound of a left-censored observation, that is \(\mathcal{N}(\mu, \sigma^2)\) distributed. For a right-censored observation you have to use the complementary cumulative distribution function normal_lccdf(X | mu, sigma).

A way to encode censoring in a Stan model such that all types of data can be treated equally is by using a custom probability distribution function censored_normal_lpdf that takes an additional argument cc that determines the type of censoring. Below we use the following codes:
Uncensored 0
Left-censored 1
Right-censored 2
Missing data 3

The function censored_normal_lpdf is defined in the functions of the Stan model below. In addition to N Observations, we add N CensorCodes that must be either 0, 1, 2, or 3 in the data block. In the model block, we now can use the sampling statement Observations[n] ~ censored_normal(mu, sigma, CensorCodes[n]), because this is just syntactic sugar for target += censored_normal_lpdf(Observations[n] | mu, sigma, CensorCodes[n]). Vectorization unfortunately does not work with this method.


To demonstrate the censored_normal distribution, I added the following python script that generates some random data and compiles and runs the Stan model.


The output of the script will be something like
mu: -0.018, 95% CrI: [-0.176, 0.133]
sigma: 1.071, 95% CrI: [0.953, 1.203]