Processing math: 100%

Sunday, 21 July 2019

Neat encoding of censored data in Stan

Suppose that some of the observations in your data set are left- or right-censored. A way to handle such data in likelihood-based models is to integrate out the censored observations. In that case you have to replace the likelihood of an observation given the parameters with a cumulative likelihood. In Stan, you can replace a sampling statement as X ~ normal(mu, sigma) with target += normal_lcdf(X | mu, sigma) when X is in fact the upper bound of a left-censored observation, that is N(μ,σ2) distributed. For a right-censored observation you have to use the complementary cumulative distribution function normal_lccdf(X | mu, sigma).

A way to encode censoring in a Stan model such that all types of data can be treated equally is by using a custom probability distribution function censored_normal_lpdf that takes an additional argument cc that determines the type of censoring. Below we use the following codes:
Uncensored 0
Left-censored 1
Right-censored 2
Missing data 3

The function censored_normal_lpdf is defined in the functions of the Stan model below. In addition to N Observations, we add N CensorCodes that must be either 0, 1, 2, or 3 in the data block. In the model block, we now can use the sampling statement Observations[n] ~ censored_normal(mu, sigma, CensorCodes[n]), because this is just syntactic sugar for target += censored_normal_lpdf(Observations[n] | mu, sigma, CensorCodes[n]). Vectorization unfortunately does not work with this method.

functions {
real censored_normal_lpdf(real x, real mu, real sigma, int cc) {
real ll;
if ( cc == 0 ) { // uncensored
ll = normal_lpdf(x | mu, sigma);
} else if ( cc == 1 ) { // left-censored
ll = normal_lcdf(x | mu, sigma);
} else if ( cc == 2 ) { // right-censored
ll = normal_lccdf(x | mu, sigma);
} else if ( cc == 3 ) { // missing data
ll = 0.0;
} else { // any other censoring code is invalid
reject("invalid censoring code in censored_normal_lpdf");
}
return ll;
}
}
data {
int<lower=0> N;
real Observations[N];
int<lower=0, upper=3> CensorCodes[N];
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
for ( n in 1:N ) {
Observations[n] ~ censored_normal(mu, sigma, CensorCodes[n]);
}
}

To demonstrate the censored_normal distribution, I added the following python script that generates some random data and compiles and runs the Stan model.

#! /usr/bin/python3
import numpy as np
import pystan
from scipy.stats import norm
## define the codes used for the type of censoring
uncensored_code = 0
left_censored_code = 1
right_censored_code = 2
missing_code = 3
## number of Observations
N = 200
## generate some (uncensored) data
UncensObs = norm.rvs(loc=0, scale=1, size=N)
## choose artificial "limits of detection"
LowerBound = -1
UpperBound = 2
## apply censoring accoring to the chosen bounds
def make_cens_obs(x):
if x < LowerBound:
return (LowerBound, left_censored_code)
elif x > UpperBound:
return (UpperBound, right_censored_code)
else:
return (x, uncensored_code)
ObsCensCodes = [make_cens_obs(x) for x in UncensObs]
## collect the data in a dictionary
data_dict = {
"N" : N,
"Observations" : [x[0] for x in ObsCensCodes],
"CensorCodes" : [x[1] for x in ObsCensCodes]
}
## compile the stan model
sm = pystan.StanModel(file="censored-normal.stan")
## sample from the posterior
sam = sm.sampling(data=data_dict)
chain_dict = sam.extract(permuted=True)
## look at the results
mu_est = np.mean(chain_dict["mu"])
mu_cri = np.percentile(chain_dict["mu"], [2.5, 97.5])
sigma_est = np.mean(chain_dict["sigma"])
sigma_cri = np.percentile(chain_dict["sigma"], [2.5, 97.5])
print(f"mu: {mu_est:0.3f}, 95% CrI: [{mu_cri[0]:0.3f}, {mu_cri[1]:0.3f}]")
print(f"sigma: {sigma_est:0.3f}, 95% CrI: [{sigma_cri[0]:0.3f}, {sigma_cri[1]:0.3f}]")

The output of the script will be something like
mu: -0.018, 95% CrI: [-0.176, 0.133]
sigma: 1.071, 95% CrI: [0.953, 1.203]