tbz533: 2019

Friday 15 November 2019

My Matplotlib Cheatsheet (#1)

I find myself Googling the same matplotlib/pyplot queries over and over again. To make life easy, I started collecting some of them in a series of blog posts.
In this first post, I'll look at subplots and annotations.

Turn off axes

We can remove axes, ticks, borders, etc. with ax.axis('off'). This can be used to remove unnecessary subplots created with the subplots function from pyplot, for instance, when we only want to use the upper triangle of the grid. Here's an example:

The result is:

Share axes between subplots after plotting

An alternative way of making a ragged array of subplots makes use of gridspec. The problem here is that it is a bit more difficult to share x and y axes. Of course add_subplot has keyword arguments sharex and sharey, but then we have to distinguish between the first and subsequent subplots. A better solution is the get_shared_x_axes() method. Here's an example:

The result is:

Hybrid (or blended) transforms for annotations

When you annotate a point in a plot, the location of the text is often relative to the data in one coordinate, but relative to the axis (e.g. in the middle) in the other. I used to do this with inverse transforms, but it turns out that there is a better way: the blended_transform_factory function from matplotlib.transforms.
Suppose that I want to annotate three points in three subplots. The arrows should point to these three points, and I want the text to be located above the point, but the text in the 3 subplots has to be vertically aligned.

Notice that the y-axes are not shared between the subplots! To accomplish the alignment, we have to use the annotate method with a custom transform.

Monday 11 November 2019

Using constant Stan functions for parameter array indices in ODE models

Stan's ODE solver has a fixed signature. The state, parameters and data has to be given as arrays. This can get confusing during development of a model, since parameters can change to constants (i.e. data), the number of parameters can change and even the state variables could change. Errors can easily creep in since the user has to correctly change all indices in the ODE function, and in (e.g.) the model block. A nice way to solve this is using constant functions. Constants in Stan are functions without an argument that return the constant value. For instance pi() returns 3.14159265... The user can define such constants in the functions block. Suppose that we want to fit the Lotka-Volterra model to data. The system of ODEs is given by \[ \frac{dx}{dt} = ax - bxy\,,\quad \frac{dy}{dt} = cbxy - dy \] and so we have a 2-dimenional state, and we need a parameter vector of length 4. In the function block, we will define a function int idx_a() { return 1; } that returns the index of the parameter \(a\) in the parameter vector, and we define similar functions for the other parameters. The full model can be implemented in Stan as shown below. The data Preys and Predators is assumed to be Poisson-distributed with mean \(Kx\) and \(Ky\), respectively, for some large constant \(K\). I fitted the model to some randomly generated data, which resulted in figure above.

Of course, this is still a low-dimensional model with a small number of parameters, but I found that even in a simple model, defining parameter indices in this way keeps everything concise.
I used the following Python script to generate the data and interface with Stan.

The following Figure show the parameter estimates together with the "real" parameter values

Saturday 3 August 2019

Tracking Jupyter notebooks with Git

Recently, I came across a PLOS Ten Simple Rules paper about Jupyter notebooks. Rule number 6 is about version control and suggests the use of nbdime to merge notebooks, or saving the notebooks as python scripts and tracking those with your version control system. I experienced problems with using Git to track changes in Jupyter notebooks a while ago, and have been using the following solution.

For me, the largest nuisance is that figures are stored as very long strings in the output of cells. This makes it almost impossible to manually resolve conflicts while merging two branches. The solution is simple: just clear the output of the cells before committing the notebook. I'm doing this with a simple script (which I found somewhere on the Internet).

The script clear_output_ipynb.py lives in the same folder (called notebooks) as my Jupyter notebooks. I don't track changes in the .ipynb files, but have "clean" copies of the notebooks (with extension .ipynb.cln) that are part of the Git repository. To make life easy, I have two makefiles in my project folder called cln.makefile and nbs.makefile. Before I stage the changes in my notebooks, I first run

$ make -f cln.makefile

which runs the script clear_output_ipynb.py for each notebook in my notebooks folder.

After I pull changes from a remote repository, or switch to another branch, I have to copy all .ipynb.cln files to .ipynb files. For this I have another makefile, and so I run

$ make -f nbs.makefile

before using and modifying the notebooks.

Of course, sometimes I forget to clean the notebooks before committing, or I forget to make the .ipynb files. I've tried to automate the process of cleaning and copying with Git "hooks", but I have not been able to make that work. If somebody knows how, let me know!

Sunday 21 July 2019

Neat encoding of censored data in Stan

Suppose that some of the observations in your data set are left- or right-censored. A way to handle such data in likelihood-based models is to integrate out the censored observations. In that case you have to replace the likelihood of an observation given the parameters with a cumulative likelihood. In Stan, you can replace a sampling statement as X ~ normal(mu, sigma) with target += normal_lcdf(X | mu, sigma) when \(X\) is in fact the upper bound of a left-censored observation, that is \(\mathcal{N}(\mu, \sigma^2)\) distributed. For a right-censored observation you have to use the complementary cumulative distribution function normal_lccdf(X | mu, sigma).

A way to encode censoring in a Stan model such that all types of data can be treated equally is by using a custom probability distribution function censored_normal_lpdf that takes an additional argument cc that determines the type of censoring. Below we use the following codes:

Uncensored	0
Left-censored	1
Right-censored	2
Missing data	3

The function censored_normal_lpdf is defined in the functions of the Stan model below. In addition to N Observations, we add N CensorCodes that must be either 0, 1, 2, or 3 in the data block. In the model block, we now can use the sampling statement Observations[n] ~ censored_normal(mu, sigma, CensorCodes[n]), because this is just syntactic sugar for target += censored_normal_lpdf(Observations[n] | mu, sigma, CensorCodes[n]). Vectorization unfortunately does not work with this method.

To demonstrate the censored_normal distribution, I added the following python script that generates some random data and compiles and runs the Stan model.

The output of the script will be something like

mu: -0.018, 95% CrI: [-0.176, 0.133]
sigma: 1.071, 95% CrI: [0.953, 1.203]

Saturday 18 May 2019

A thread-safe global random number generator using C++ <random> and <mutex>

In a previous post, Sandro showed how to use the C++ <random> header to define a global random number generator (RNG). If at some point the user decides to parallelize their program, they would have to make the global RNG thread-safe. This requires only a small modification.

I have copied Sandro's code and added thread-safety using the C++ (with standard at least c++11) headers <thread> and <mutex>. In the header file thread_safe_random.hpp, the functions RANDOM and Seed are declared.

The C++ file thread_safe_random.cpp defines the functions RANDOM and Seed, and the RNG and uniform distribution as before. Additionally, a mutex my_rng_mutex is defined to guard my_rng. I am using a lock_guard to lock and release the mutex. When one thread of execution calls the function RANDOM, it acquires the mutex. Any other thread that calls RANDOM, has to wait until the mutex is released.

In order to demonstrate thread_safe_random, I created max threads in the main function that use the auxiliary function fetch_random_number to call RANDOM.

The result should look like

$ g++ --std=c++11 -pthread main.cpp thread_safe_random.cpp -o test_safe_random
$ ./test_safe_random
0.284779
0.243487
0.161906
0.338338
0.235765
0.502853
0.389262
0.165401
0.244871
0.194046

However, the order of these numbers can change each time you execute the program. This means that the program is no longer deterministic (although we can argue about what it means to be deterministic or not), because the OS determines the order in which the threads call RANDOM. Another problem with this implementation is that it will be slow when each thread needs many random numbers.

Friday 17 May 2019

Copying polymorphic C++ objects using an inherited dup() method

In order to copy polymorphic objects in C++, it can be convenient to equip the base and derived classes with a .dup() method (or .clone() method) that returns a pointer to a copy of the (derived) object. When you have a large amount of different derived classes, overriding the base class's .dup() method for each of them can be a bit of a nuisance. In order to solve this, I sometimes use an "intermediate" class template that can be inherited instead of the base class to provide the .dup() method. This solution is not perfect, because it does not provide the possibility of covariant return types.

The class template Cloneable is defined as follows:

In the following code snippet, the use of the Cloneable is demonstrated:
If someone has a better way to do this, let me know.

tbz533