## Friday, 6 November 2015

### ETA for C++

Simulations can take some time, and I'd like to know how long. This is easy, right? Yes, it is. I've done it lots of times, but every time I do, I curse myself for not using an old piece of code.
most likely, there is some standard, best way of doing this, but I haven't found it. Most recently, I did this: I made a simple object "EtaEstimator", that can be updated every (costly) time step and asked for an estimated time of "arrival" at any time. Here's the header:
// eta.hpp
#include <ctime>
#include <cmath> // floor
#include <iostream>

class EtaEstimator {
public:
EtaEstimator(int );
// constuction starts the clock. Pass the number of steps
void update();
void print(std::ostream & ) const;
private:
double ct, etl; // cumulative time, estimated time left
int n, N; // steps taken, total amount of steps
clock_t tick; // time after update ((c) matlab)
// statics...
static const int secperday = 86400;
static const int secperhour = 3600;
static const int secperminute = 60;
};

std::ostream & operator<<(std::ostream & , const EtaEstimator & );

The members are straight forward too:
// eta.cpp
#include "eta.hpp"

EtaEstimator::EtaEstimator(int N) :
ct(0.0), etl(0.0), n(0), N(N) {
tick = clock();
}

void EtaEstimator::update() {
clock_t dt = clock() - tick;
tick += dt;
ct += (double(dt)/CLOCKS_PER_SEC); // prevent integer division
// CLOCKS_PER_SEC is defined in ctime
++n;
etl = (ct/n) * (N-n);
}

void EtaEstimator::print(std::ostream & os) const {
double etlprime = etl;
int days = floor(etlprime / secperday);
etlprime -= days * secperday;
int hours = floor(etlprime / secperhour);
etlprime -= hours * secperhour;
int minutes = floor(etlprime / secperminute);
etlprime -= minutes * secperminute;
int seconds = floor(etlprime);
os << (days > 0 ? std::to_string(days) + " " : "")
<< hours << ":"
<< (minutes < 10 ? "0" : "") << minutes << ":"
<< (seconds < 10 ? "0" : "") << seconds;
}

std::ostream & operator<<(std::ostream & os,
const EtaEstimator & eta) {
eta.print(os);
return os;
}

Typical usage of EtaEstimator would be the following:
#include <iostream>
#include "eta.hpp"

// about to do lots of work...
int N = 1000;
EtaEstimator eta(N);
for ( int n = 0; n < N; ++n ) {
// do something very expensive
eta.update()
std::cout << "\rETA: " << eta << std::flush;
}
// ...

PS: std::to_string is a C++11 feature, and can be ignored by using something like
if ( days > 0 ) os << days << " "; // else nothing at all


## Monday, 26 October 2015

### Carnes's life span distribution

In a paper by Carnes et. al, a simple parametric, but realistic life span distribution is given, and here I show how you can sample from it. In addition, assuming a demographic equilibrium, the age of individuals will have a particular distribution. I will show what this distribution is, and again how to sample from it. Sampling ages instead of lifespans might be useful for initiating simulations. I model epidemics, and I want my disease free (a.k.a. virgin) population to have the 'right' age distribution.
The life span distribution has hazard $\lambda(a) = e^{u_1 a + v_1} + e^{u_2 a + v_2}\,.$ Typical parameters are given by $$u_1 = 0.1$$, $$v_1 = -10.5$$, $$u_2 = -0.4$$, and $$v_2 = -8$$, so that infants have a slightly increased hazard of dying, and after the age of 60, the hazard rapidly starts to grow, until it becomes exceedingly large around $$a = 100$$.
The survival function $$S(a) = {\rm Pr}(A > a)$$, where $$A$$ is the random variable denoting 'age at death' is given by $$S(a) = e^{-\Lambda(a)}$$, with $$\Lambda(a) := \int_0^a \lambda(a')da'$$ denoting the cumulative hazard. The cumulative hazard $$\Lambda$$ is easily calculated: $\Lambda(a) = \frac{e^{v_1}}{u_1}(e^{u_1 a}-1) + \frac{e^{v_2}}{u_2}(e^{u_2 a}-1)\,,$ but its inverse, or the inverse of the survival function is more difficult to calculate.
We need the inverse of $$S$$, because sampling random deviates typically involves uniformly sampling a number $$u\in [0,1)$$. The number $$S^{-1}(u)$$ is then the desired deviate.

In a future post, I will show how to use the GNU Scientific Library (GSL) to sample deviates from $$A$$.

Suppose that the birth rate $$b$$ in our population is constant. A PDE describing the population is given by $\frac{\partial N(a,t)}{\partial t} + \frac{\partial N(a,t)}{\partial a} = -\lambda(a)N(a,t)\,,$ where $$N(a,t)$$ is the number (density) of individuals of age $$a$$, alive at time $$t$$. The boundary condition (describing birth) is given by $N(0,t) = b$ When we assume that the population is in a demographic equilibrium, the time derivative with respect to $$t$$ vanishes, and we get an ODE for the age distribution: $\frac{\partial N(a)}{\partial a} = -\lambda(a) N(a)\,,\quad N(0) = b\,,$ where we omitted the variable $$t$$. This equation can be solved: $\frac{1}{N}\frac{\partial N}{\partial a} = \frac{\partial \log(N)}{\partial a} = -\lambda(a) \implies N(a) = c \cdot e^{-\Lambda(a)}$ for some constant $$c$$. Since $$b = N(0) = c \cdot e^{-\Lambda(0)} = c$$, we have $N(a) = b\cdot e^{-\Lambda(a)}\,.$ Hence, we now know the PDF of the age distribution (up to a constant). Unfortunately, we can't get a closed form formula for the CDF, let alone invert it. Therefore, when we want to sample, we need another trick. I've used a method from Numerical Recipies in C. It involves finding a dominating function of the PDF, with an easy, and easily invertible, primitive.
Let's just assume that $$b = 1$$, so that the objective PDF is $$N(a) = e^{-\Lambda(a)}$$. Please notice that $$N$$ is not a proper PDF, since, in general, the expectation won't be $$1$$. We need to find a simple, dominating function for $$N$$. A stepwise defined function might be a good choice, since the hazard is practically zero when the age is below $$50$$, and then increases rapidly. We first find a comparison cumulative hazard $$\tilde{\Lambda}$$ that is dominated by the actual cumulative hazard $$\Lambda$$. Many choices are possible, but one can take for instance $\tilde{\Lambda}(a) = \left\{ \begin{array}{ll} 0 & \mbox{if } a < a_0 \\ \lambda(a^{\ast})\cdot (a-a_0) & \mbox{otherwise} \end{array}\right.$ where $a_0 = a^{\ast} - \frac{\Lambda(a^{\ast})}{\lambda(a^{\ast})}\,.$ The constant $$a_0$$ is chosen such that the cumulative hazards $$\Lambda$$ and $$\tilde{\Lambda}$$ are tangent at $$a^{\ast}$$.
Since $$\Lambda$$ dominates $$\tilde{\Lambda}$$, the survival function $$\tilde{S}$$ defined by $$\tilde{S}(a) = e^{-\tilde{\Lambda}(a)}$$ dominates $$S$$. It is easy to find the inverse of $$a\mapsto\int_0^a \tilde{S}(a')da'$$, and hence we can easily sample random deviates from the age distribution corresponding to $$\tilde{\Lambda}$$. In order to sample from the desired age distribution, one can use a rejection method: (i) sample an age $$a$$ from the easy age distribution. (ii) compute the ratio $$r = S(a)/\tilde{S}(a) \leq 1$$. (iii) sample a deviate $$u \sim \mbox{Uniform}(0,1)$$. (iv) accept the age $$a$$ when $$u \leq r$$, and reject $$a$$ otherwise. Repeat these steps until an age $$a$$ was accepted.

The only thing we still need to do, is to find a good value for $$a^{\ast}$$. To make the sampler as efficient as possible, we want to minimize the probability that we have to reject the initially sampled age $$a$$ from $$S$$. This boils down to minimizing $\int_0^{\infty} \tilde{S}(a)da = a_0 + \frac{1}{\lambda(a^{\ast})} = a^{\ast} + \frac{1 - \Lambda(a^{\ast})}{\lambda(a^{\ast})}\,.$ The derivative of $$a^{\ast} \mapsto \int_0^{\infty} \tilde{S}(a)da$$ equals $\frac{(\Lambda(a^{\ast}) - 1)\lambda'(a^{\ast})}{\lambda(a^{\ast})^2}$ and thus, we find an extreme value for $$\int_0^{\infty} \tilde{S}(a)da$$ when $$\Lambda(a^{\ast}) = 1$$ or when $$\lambda'(a^{\ast}) = \frac{d\lambda}{da^{\ast}} = 0$$. The second condition can only correspond to a very small $$a^{\ast}$$, and therefore will not minimize $$\int_0^{\infty} \tilde{S}(a)da$$. Hence, we have to solve $$\Lambda(a^{\ast}) = 1$$. When we ignore the second term of $$\Lambda$$, we find that: $a^{\ast} = \log(1 + u_1 \exp(-v_1))/u_1$

## Friday, 18 September 2015

### Basic statistics using higher-order functions in C++

I do a lot of individual-based simulations, and often I want to calculate some population statistics 'on the fly'. I found that it can be helpful to use C/C++'s basic functional capabilities.

A higher-order function is a function that takes other functions as arguments (or returns a function as a result). In C/C++, functions can be passed to other functions, but the notation can be a bit awkward, as one needs to pass a reference to a function. If the function is a method of some class, then the notation gets even more involved. You can make your life easier by using a typedef.

The following code snippet shows the header file of a simple example. The goal is to calculate some statistics on a population of "Critters". These Critters have multiple "traits", and the traits are accessed by methods of the class Critter of signature "double Critter::trait() const". Suppose that we want to calculate the mean of the trait "happiness". It's trivial to write a function that does this, but then we might also get interested in the average "wealth". The function that calculates the average wealth is identical to the function that calculates average happiness, except that happiness is replaced by wealth. We can get rid of this redundancy by defining the typedef Trait as a method of Critter, and writing a function that takes the average of an arbitrary Trait.

Let us now look at the source file. The most important things to notice are...
(1) whenever you pass a member "trait" (e.g. wealth) of Critter to a function, you should pass it as "&Critter::trait" (i.e. pass the reference).
(2) when you want to evaluate Trait x of Critter alice, you'll need to de-reference x, and call the resulting function: "(alice.*x)()"

If you want to play with this example, put the header in a file called "main.hpp", and the source in a file called "main.cpp", and compile main.cpp by typing "g++ main.cpp -o critters" in your terminal (I assume that you are using Linux and have the gcc compiler installed).

## Friday, 10 July 2015

### Blob plots, like violin plots, but more precise

Violin plots are beautiful, useful and much clearer than drawing several histograms on the same plot.
In my opinion, the smoothing function that they implement with python does not really compensate the loss of precision with the visual appealing.
Why not plotting the whole distribution, then?

I modified the violin plots source code that I found here HERE

And produced... let's called them blob plots:

Here is the code.

Feel free to take it and do whatever you want with it.
If you have ideas to improve it let me know!
If you like the result, go say thanks to the violin plot maker (I did very little)

## Friday, 26 June 2015

### A simple class wrapper for parallelism in C++

Concurrency can be extremely complicated, and causes problems that will haunt you in your dreams. The classical libraries in C/C++ don't protect you from this horror in any way, and you will have to figure it out yourself. Parallelism is supposed to be a lot easier, but C/C++ does not have standard libraries like---for instance---Pythons parallelism modules. The boost libraries, however, provide an extensive interface for concurrency and parallelism.
If you don't want to use boost, don't panic. There are other options. The POSIX pthread library provides a down-to-the-metal concurrency model. It took me a while to find out what it is all about, and I haven't successfully applied all it's possibilities. What I have managed to apply, is a so-called "worker-pool" model. This is one of the easier concurrency applications (it falls under the parallelism category) of the pthread library, but can be quite useful. Here I will demonstrate a "wrapper" that C++ classes can inherit.

Suppose that you have a class Fun, that needs to do some computations. We declare Fun in---say---Fun.hpp:
Fun inherits the worker pool functionality from the class "WorkerPoolWrapper" (imported from WorkerPoolWrapper.hpp). The class WorkerPoolWrapper has a "pure virtual" member executeJob(void* ). You, the user, must specify what you want to compute in your own definition of the executeJob method. Besides implementing executeJob, you must also initiate the worker pool, and somewhere before Fun's destuctor returns, the threads must be "joined", i.e., all worker threads must finish their execution. In this case, I use the constructor and destructor of Fun to accomplish these things:
The methods initWorkerPool and waitForWorkerPoolToExit are inherited from WorkerPoolWrapper. Lets use Fun to compute the number of primes pi(n) below a number n. We overload operator() as follows:
Notice that this implementation of pi(n) is not the most efficient one. It checks for every integer i between 0 and n whether i is prime or not. This prime test is performed by executeJob. In the background, WorkerPoolWrapper has a list of jobs that have to be executed. Jobs can be added to this job list using addNewJob(void* ). Once executed, the result of a job must somehow be stored in the job again. Above, the number pi is the sum of the array ns, which makes sense when we look at the implementation of executeJob:
Hence, executeJob transforms the number i pointed to by job into 0 if i is composit, or 1 if i is prime, such that the sum of the i's equals pi(n). Before we gathered the results in Fun::operator(), we called syncWorkerThreads(). This method lets the program halt until every job in the job list has been executed.
Using the functionoid Fun now works as follows:
The class WorkerPoolWrapper is declared here:
And the members are defined here: The credits for combining pthread with C++ classes go to Jeremy Friesner.