In a paper by Carnes et. al, a simple parametric, but realistic life span distribution is given, and here I show how you can sample from it.
In addition, assuming a demographic equilibrium, the age of individuals will have a particular distribution. I will show what this distribution is, and again how to sample from it. Sampling ages instead of lifespans might be useful for initiating simulations. I model epidemics, and I want my disease free (a.k.a. virgin) population to have the 'right' age distribution.
The life span distribution has hazard λ(a)=eu1a+v1+eu2a+v2. Typical parameters are given by u1=0.1, v1=−10.5, u2=−0.4, and v2=−8, so that infants have a slightly increased hazard of dying, and after the age of 60, the hazard rapidly starts to grow, until it becomes exceedingly large around a=100.
The survival function S(a)=Pr(A>a), where A is the random variable denoting 'age at death' is given by S(a)=e−Λ(a), with Λ(a):=∫a0λ(a′)da′ denoting the cumulative hazard. The cumulative hazard Λ is easily calculated: Λ(a)=ev1u1(eu1a−1)+ev2u2(eu2a−1), but its inverse, or the inverse of the survival function is more difficult to calculate.
We need the inverse of S, because sampling random deviates typically involves uniformly sampling a number u∈[0,1). The number S−1(u) is then the desired deviate.
In a future post, I will show how to use the GNU Scientific Library (GSL) to sample deviates from A.
Suppose that the birth rate b in our population is constant. A PDE describing the population is given by ∂N(a,t)∂t+∂N(a,t)∂a=−λ(a)N(a,t), where N(a,t) is the number (density) of individuals of age a, alive at time t. The boundary condition (describing birth) is given by N(0,t)=b When we assume that the population is in a demographic equilibrium, the time derivative with respect to t vanishes, and we get an ODE for the age distribution: ∂N(a)∂a=−λ(a)N(a),N(0)=b, where we omitted the variable t. This equation can be solved: 1N∂N∂a=∂log(N)∂a=−λ(a)⟹N(a)=c⋅e−Λ(a) for some constant c. Since b=N(0)=c⋅e−Λ(0)=c, we have N(a)=b⋅e−Λ(a). Hence, we now know the PDF of the age distribution (up to a constant). Unfortunately, we can't get a closed form formula for the CDF, let alone invert it. Therefore, when we want to sample, we need another trick. I've used a method from Numerical Recipies in C. It involves finding a dominating function of the PDF, with an easy, and easily invertible, primitive.
Let's just assume that b=1, so that the objective PDF is N(a)=e−Λ(a). Please notice that N is not a proper PDF, since, in general, the expectation won't be 1. We need to find a simple, dominating function for N. A stepwise defined function might be a good choice, since the hazard is practically zero when the age is below 50, and then increases rapidly. We first find a comparison cumulative hazard ˜Λ that is dominated by the actual cumulative hazard Λ. Many choices are possible, but one can take for instance ˜Λ(a)={0if a<a0λ(a∗)⋅(a−a0)otherwise where a0=a∗−Λ(a∗)λ(a∗). The constant a0 is chosen such that the cumulative hazards Λ and ˜Λ are tangent at a∗.
Since Λ dominates ˜Λ, the survival function ˜S defined by ˜S(a)=e−˜Λ(a) dominates S. It is easy to find the inverse of a↦∫a0˜S(a′)da′, and hence we can easily sample random deviates from the age distribution corresponding to ˜Λ. In order to sample from the desired age distribution, one can use a rejection method: (i) sample an age a from the easy age distribution. (ii) compute the ratio r=S(a)/˜S(a)≤1. (iii) sample a deviate u∼Uniform(0,1). (iv) accept the age a when u≤r, and reject a otherwise. Repeat these steps until an age a was accepted.
The only thing we still need to do, is to find a good value for a∗. To make the sampler as efficient as possible, we want to minimize the probability that we have to reject the initially sampled age a from S. This boils down to minimizing ∫∞0˜S(a)da=a0+1λ(a∗)=a∗+1−Λ(a∗)λ(a∗). The derivative of a∗↦∫∞0˜S(a)da equals (Λ(a∗)−1)λ′(a∗)λ(a∗)2 and thus, we find an extreme value for ∫∞0˜S(a)da when Λ(a∗)=1 or when λ′(a∗)=dλda∗=0. The second condition can only correspond to a very small a∗, and therefore will not minimize ∫∞0˜S(a)da. Hence, we have to solve Λ(a∗)=1. When we ignore the second term of Λ, we find that: a∗=log(1+u1exp(−v1))/u1
The life span distribution has hazard λ(a)=eu1a+v1+eu2a+v2. Typical parameters are given by u1=0.1, v1=−10.5, u2=−0.4, and v2=−8, so that infants have a slightly increased hazard of dying, and after the age of 60, the hazard rapidly starts to grow, until it becomes exceedingly large around a=100.
The survival function S(a)=Pr(A>a), where A is the random variable denoting 'age at death' is given by S(a)=e−Λ(a), with Λ(a):=∫a0λ(a′)da′ denoting the cumulative hazard. The cumulative hazard Λ is easily calculated: Λ(a)=ev1u1(eu1a−1)+ev2u2(eu2a−1), but its inverse, or the inverse of the survival function is more difficult to calculate.
We need the inverse of S, because sampling random deviates typically involves uniformly sampling a number u∈[0,1). The number S−1(u) is then the desired deviate.
In a future post, I will show how to use the GNU Scientific Library (GSL) to sample deviates from A.
Suppose that the birth rate b in our population is constant. A PDE describing the population is given by ∂N(a,t)∂t+∂N(a,t)∂a=−λ(a)N(a,t), where N(a,t) is the number (density) of individuals of age a, alive at time t. The boundary condition (describing birth) is given by N(0,t)=b When we assume that the population is in a demographic equilibrium, the time derivative with respect to t vanishes, and we get an ODE for the age distribution: ∂N(a)∂a=−λ(a)N(a),N(0)=b, where we omitted the variable t. This equation can be solved: 1N∂N∂a=∂log(N)∂a=−λ(a)⟹N(a)=c⋅e−Λ(a) for some constant c. Since b=N(0)=c⋅e−Λ(0)=c, we have N(a)=b⋅e−Λ(a). Hence, we now know the PDF of the age distribution (up to a constant). Unfortunately, we can't get a closed form formula for the CDF, let alone invert it. Therefore, when we want to sample, we need another trick. I've used a method from Numerical Recipies in C. It involves finding a dominating function of the PDF, with an easy, and easily invertible, primitive.
Let's just assume that b=1, so that the objective PDF is N(a)=e−Λ(a). Please notice that N is not a proper PDF, since, in general, the expectation won't be 1. We need to find a simple, dominating function for N. A stepwise defined function might be a good choice, since the hazard is practically zero when the age is below 50, and then increases rapidly. We first find a comparison cumulative hazard ˜Λ that is dominated by the actual cumulative hazard Λ. Many choices are possible, but one can take for instance ˜Λ(a)={0if a<a0λ(a∗)⋅(a−a0)otherwise where a0=a∗−Λ(a∗)λ(a∗). The constant a0 is chosen such that the cumulative hazards Λ and ˜Λ are tangent at a∗.
Since Λ dominates ˜Λ, the survival function ˜S defined by ˜S(a)=e−˜Λ(a) dominates S. It is easy to find the inverse of a↦∫a0˜S(a′)da′, and hence we can easily sample random deviates from the age distribution corresponding to ˜Λ. In order to sample from the desired age distribution, one can use a rejection method: (i) sample an age a from the easy age distribution. (ii) compute the ratio r=S(a)/˜S(a)≤1. (iii) sample a deviate u∼Uniform(0,1). (iv) accept the age a when u≤r, and reject a otherwise. Repeat these steps until an age a was accepted.
The only thing we still need to do, is to find a good value for a∗. To make the sampler as efficient as possible, we want to minimize the probability that we have to reject the initially sampled age a from S. This boils down to minimizing ∫∞0˜S(a)da=a0+1λ(a∗)=a∗+1−Λ(a∗)λ(a∗). The derivative of a∗↦∫∞0˜S(a)da equals (Λ(a∗)−1)λ′(a∗)λ(a∗)2 and thus, we find an extreme value for ∫∞0˜S(a)da when Λ(a∗)=1 or when λ′(a∗)=dλda∗=0. The second condition can only correspond to a very small a∗, and therefore will not minimize ∫∞0˜S(a)da. Hence, we have to solve Λ(a∗)=1. When we ignore the second term of Λ, we find that: a∗=log(1+u1exp(−v1))/u1