#include
files.
This is going to be my first pull request, so mistakes will be made.
I based all this on tutorials from the Stan Math website (specifically the sections on "adding new functions" and "adding new distributions"), and the stanc webpage. The first link explains how to add functions to the Stan Math library, and the second explains how to make such functions available in the Stan language.
Some updates about the submission process:
2023-12-09: After trying three times, I managed to pass all automated tests for the Stan Math library (see Steps 2 and 7 below). Now I'm waiting for code reviews from a Stan developer.
2023-12-23: Last week a Stan developer reviewed my code, made improvements and requested some changes. I discuss the improvements below (Step 7), as they are really instructive. My PR was then approved and merged with the "develop" branch. I then submitted PRs for the documentation and the stan-to-c++ compiler. This went through without any issues. This means that the Dirichlet-multinomial distribution should be available in Stan version 2.34.
2024-01-15: I posted this blog post on the Stan furum and received some useful feed-back. I've added some notes below.
All code and documentation published in this blog post is licensed under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
Step 1: Coding the distribution as a user-defined function in Stan
A good place to start is to write a working Stan model that we can use for testing and benchmarking. The Dirichlet-Multinomial distribution is a Multinomial distribution in which the probability vector (simplex) is sampled from a Dirichlet distribution. We can actually use this fact to generate random samples in Stan. Lets start there, because it is easy.To define a custom RNG in Stan, you write a function in the
functions
block with name distribution_name_rng
where distribution_name
is the name of your distribution. In our case, that's dirichlet_multinomial
. The RNG function typically uses native Stan RNGs to generate random samples, which are then transformed, or rejected until they satisfy some condition, or whatever you need. In this case, we first sample a probability vector from the Dirichlet distribution using dirichlet_rng(alpha)
. Then, we use this probability vector as input for the Multinomial distribution to get a integer array with multinomial_rng(dirichlet_rng(alpha), N)
. Here N
is the "population size" parameter of the multinomial distribution (also referred to as the "number of trials").
Using some python code,
we can use the Stan model to sample random Dirichelet Multinomial deviates \(X\) and plot \(X / N\) on a \(K\)-simplex. Here we generate \(40\,000\) samples using \(N = 1000\) and \(K = 3\).
Next, we define a probability mass function (PMF), which, like everything has to be defined on the log scale in Stan. The PMF of the Dirichlet Multinomial distribution is \[ p(x | \alpha) = \frac{N B(\alpha_0, N)}{\prod_{k : x_k > 0} x_k B(\alpha_k, x_k)}\] where \(\alpha_0 = \sum_{k=1}^K \alpha_k \), \( N = \sum_{k=1}^K x_k \) and \(K\) is the number of categories. The function \(B\) is the Beta function. We can code this in Stan almost verbatim using the
lbeta
function, which computes \(\log \circ B\), and replacing products with sums.
This implementation contains a small bug, as it does not work for \(x_1 = x_2 = \cdots = x_K = 0\), but we'll fix that later.
To do a simple test, I colored randomly sampled points (using \(\alpha = (2.4, 4.0, 6.0)'\) and \(N = 1000\)) scaled to fit in a 3-simplex in three ways. First I used the
scipy.stats
implementation of the Dirichlet-Multinomial PMF, then I used the dirichlet_multinomial_lpmf
function defined in the functions
block (and exponentiated the values).
Finally, I used a Gaussian KDE (from scipy.stats
) to compute an
emperical density of points in the sample (this is how I colored points in the simplex plot above). The tree methods of coloring
give very similar results.
Step 2: Adding the distribution to the Stan math library
Stan is built on top of the Stan math library. All native functions are defined here. We first have to clone the Stan math library from github withgit clone git@github.com:stan-dev/math.git
. Documentation for the Stan math library can be found here.
It shows you how to write programs that use the math library, giving you access to automatic differentiation, Stan functions, etc.
The distributions can be found in the folder math/stan/math/prim/prob
. Here we want to add the files dirichlet_multinomial_rng.hpp
for the random number generator, dirichlet_multinomial_lpmf.hpp
for the probability mass function and dirichlet_multinomial_log.hpp
which is only there for legacy reasons. In the same folder, we also find the Dirichlet distribution, the Multinomial distribution and the Beta-Binomial distribution.
If we somehow combine these three, we should get our Dirichlet-Multinomial distribution, so these examples are very useful.
Let's start with the RNG, which is easiest.
We'll talk a bit about the docstring later, so let's ignore that for now. The actual logic of the code is really simple: we make use of the Dirichlet and Multinomial RNGs. Therefore, we have to
#include
their header files.
The difficulty is understanding some C++ tricks, error checking, and working with the Eigen library.
The first observation is that Stan math functions are defined in the namespace stan::math
.
Next, we see that the parameter alpha
has a somewhat complicated type
const Eigen::Matrix<double, Eigen::Dynamic, 1>& alpha
This means that alpha
is a \(K \times 1\) matrix (a vector) with double
entries.
\(K\) is not determined at compile time. The variable function
is used for generating error messages and
contains the name of the C++ function. The alpha_ref
variable is the result of the to_ref
function call. What this does is "evaluating" Eigen expressions. Eigen expressions can be "lazy" in the sense
that they are not immediately evaluated. In this case, alpha
could be the result of some
complicated vector computation. Here, we need the actual values of alpha. If alpha
is already
evaluated, then to_ref
does nothing.
Next, we validate the input. The variable
alpha
has to be positive, and not infinite.
There is also a check_positive
test, that does allow positive infinity. I guess \(+\infty\) could make
sense if only one of the \(\alpha_k\) is infinite. But if multiple \(\alpha_k\) are infinite, the distribution is no
longer well defined. So it's better to avoid infinities.
The integer N
has to be non-negative, and all integers are finite.
As mentioned, the logic is identical to the user-defined Stan function. There is one exception: the case where \(N = 0\). If we look at the RNG for the Binomial distribution (v4.7.0 (September 5, 2023)), we find that \(N\) has to be positive:
check_positive(function, "number of trials variables", N);
and using this RNG with \(N=0\) indeed results in an exception:
Exception: multinomial_rng: number of trials variables is 0,
but must be positive!
On the other hand, the multinomial_lpmf
function works fine when \(N = \sum_{k=1}^K n_k = 0\), and return \(0 = \log(1)\).
So I don't think the check_positive
check is correct. I might file an issue on GitHub in the future.
For now, we have to make sure that the case \(N = 0\) is handled separately by our RNG: we just return the zero vector.
Next, we are going to write the code for the probability mass function. The distribution is discrete, so we have to define a function than ends with
_lpmf
instead of _lpdf
.
If you read the example for "adding new distributions" on the Stan Math page,
you'll find that a lot of effort is put in computing the gradients of the function, and also making sure the function is
properly vectorized. The binomial example that I based my implementation on did not do any of these things. It relies on
automatic differentiation and is currently not vectorized. In fact none of the multivariate discrete distributions are currently
vectorized. Therefore I did not try to implement vectorization for the Dirichlet-Multinomial either.
In the non vectorized case, it also does not make much sense to manually compute derivatives. The log-PMF is just a sum of log-Beta functions.
I'll leave vectorization and gradients for a future PR.
Alas, as it turns out, the Stan team did not let me get away with this so easily. See section "Revisions" in "Step 7" below. Thankfully, I got a lot of help. Also, the version of the code with analytic derivatives is somewhat faster than my lazy version shown here.
Again, the logic of the actual computations are pretty simple and very similar to our user-defined Stan function. We also perform similar domain checks as in the RNG function. In addition, we check that
ns
and alpha
have compatible sizes. Then we define lp
the log-probability. The type of lp
has to be the same as the
base type of alpha
, and this can be achieved with return_type_t
. If you had multiple
parameters where some have base type double
and others Stan's autodiff type var
, then passing all parameter types to
return_type_t
would result in the appropriate return type var
.
The template parameter T_prior_size
is the type of the "prior size" (or intensity) parameter alpha
.
There are some restrictions on what alpha
is allowed to be, and the template parameter
require_eigen_col_vector_t<T_prior_size>* = nullptr
ensures that alpha
has the correct Eigen type.
I would like to understand this better, but that will have to wait.
Next let's look at the
propto
template parameter. This is used for dropping constant terms in the computation
of the log-PMF. In a Stan sampling statement, we don't have to compute any constant terms of the log-probability or
log-density, as constant terms do not matter in MCMC. As \(n\) is an integer vector and therefore always constant "data",
we don't have to compute log(sum_ns)
or log(ns[i])
. Not computing these logs can same time and energy.
In some cases, we do want to know the absolute log-PMG. For instance in the generated quantities
block when we
want to use log-probabilities for model comparison (WAIC, LOO_IC, etc.).
It is also possible that alpha
is data (meaning that the base type is double
. In that case,
there is nothing to compute if we're only interested in the log PMF up-to constants. Hence, we return 0.0
.
To test if "summands" depending on the "prior size" have to be included in the log PMF, if propto == true
,
we can use include_summand<propto, T_prior_size>::value
.
Next, we compute the terms of the log PMF that include alpha
, and in the same loop, we compute the sum of the ns
.
If sum_ns
is zero (\(N = 0\)), then lp
is still zero and we can simply return this value.
Then we compute \(\alpha_0 = \sum_{k=1}^K \alpha_k\) using Eigen's
sum
method. Note that the type
is value_type_t<T_prior_size>
, which is the base type of alpha
.
The difference between return_type_t
and value_type_t
is that the former
takes multiple arguments and returns the most "general" base type that fits all of them. The arguments
of return_type_t
don't have to be containers.
If
propto
is false, we still have to add the terms that only depend on ns
.
To give the user access to the functions we just defined, we have to add
#include
directives in the file prob.hpp
in folder stan/math/prim
.
This file contains a list of include directions for all distributions in alphabetical order. So we just add our new files in the right place
Let's now write a C++ program that imports the Stan library to test our new function. Instructions on how to use the Stan math library in your project can be found here. In this example, I will simply sample a random vector from the Dirichlet-Multinomial distribution with fixed intensity. I then compute the log-probability of that sample. I put some effort in generating "real" random numbers using
std::random_device
,
and I wrote a function to print std::vector
. It helps to read a bit about the Eigen library,
which has some methods for formatting matrices and vectors.
This program prints something like:
random sample from dirichlet_multinomial(100, (10.1, 3.5, 7.2)) = (30, 8, 62)
log dirichlet_multinomial((30, 8, 62) | (10.1, 3.5, 7.2)) = -8.60311
To compile a program that makes use of the Stan math library, Stan provides a useful tool called "standalone".
Compilation of the above code is done as follows
make -f path_to_stan_math/make/standalone hello_dirmult
Building the Stan math library can be done in a similar way
make -j4 -f path_to_stan_math/make/standalone math-libs
The -j4
flag determines the number of CPU cores used for the task (or threads). I have an old core i5. So I'll use 4.
If we look at the makefile in the Stan math directory, we see that there is a target called
doxygen
.
This is for building the documentation with doxygen using docstrings in the code.
In our case, we have used some LaTeX equations (using delimiters \f$ ... \f$
and \f[ ... \f]
),
that are rendered by doxygen. Let's check that this is rendered correctly, and that we didn't make any TeX errors.
We can build the documentation with make doxygen
. We then find the HTML pages in doc/api/html
and view them in a browser. Here's a screenshot
And here's one for the RNG documentation
Building the docs revealed several mistakes (and I left at least one in the screenshots above), so doing this was definitely helpful.
Step 3: Writing unit tests
This step may sound a bit boring, but I found out that it is quite important. First of all the Stan team will not accept your code without any unit tests. Second, testing your code helps. I managed to catch two errors in my (very small) piece of code, all because of writing tests. Stan makes use of the GoogleTest framework. There are lots of examples of tests for distributions in the foldermath/test/unit/math/prim/prob/
.
I took multinomial_test.cpp
as a template, which I modified.
One of the tests checks that a exception is thrown if one of the intensity parameters are infinite.
For the multinomial distribution, the function checks that the parameters form a simplex.
Hence, it also checks of the values are finite, as they are in the interval \((0,1)\).
In our case, the intensities only have to be positive, and so initially I changed check_simplex
into check_positive
.
However, infinity is also positive and therefore this check does not raise an error if one of the parameters is infinite.
Therefore I changed the check to check_positive_finite
(provided in math/stan/math/prim/err.hpp
) and this time my test passed.
The other error was produced by a case in which all counts are zero. In our definition of the PMF, we would lead to \(0 / 0\) which gives a NaN.
However, if the number of trials (or population size) is zero, then it makes sense to define the probability of the event \((0,0,\dots,0)'\) as \(1\),
since this is the only possible event. I fixed this by checking for this condition in dirichlet_multinomial_lpmf
.
I will not show you all the unit tests I wrote/modified but just to give an idea, this is the first one where I compare the value of the Dirichlet-Multinomial PMF with a known value (calculated with scipy.stats
in Python)
What's happening here is that
TEST
is a macro defined in gtest.h
. The identifiers ProbDistributionsDirichletMultinomial
and DirichletMultinomial
are used to name the test function. The test function's body is defined in curly brackets.
We then create a real vector \(\theta = (2.0, 3.0, 5.0)'\) and an integer vector \(n = (1,2,3)'\) and test if
\[
\log(p(n | \theta)) = -2.477938
\]
where \(p\) is the PMF of the Dirichlet-Multinomial distribution.
You can run all or a subset of tests with the
runTests.py
python script. For instance
python3 runTests.py test/unit/math/prim/prob/dirichlet_multinomial_test
runs only the new tests. Warning: running all tests takes a lot of time.
Step 4: Adding the distribution to the Stan documentation
Before we expose our new distribution to the user (i.e. make it a native function in the Stan programming language), we have to write some documentation for the Stan functions reference. We can clone the documentation withgit clone git@github.com:stan-dev/docs.git
. The documentation is written in R markdown. This is the part that I added to
multivariate_discrete_distributions.Rmd
in the folder docs/src/functions-reference
This is largely based on the multinomial example. Note that all functions have an HTML-commented function signature. The README.md says this about these comments:
The Stan Functions Reference contains HTML comments which describe the function signature for all functions.
The script extract_function_sigs.py
is used to scrape these signatures into a plain text file.
This feature exists for Stan-specific syntax highlighting.
Syntax highlighters (see this link for a list)
have to know which variable names correspond to native functions so that they can be colored distictly.
To build the documentation, use the python script build.py
as follows
python3 build.py 2 34
The numbers 2 and 34 correspond to the major and minor version numbers. The R markdown file contains strings such as `r since("2.34")`
which correspond to the version number when a function became available. It was not clear to me what version I should use in my new documentation.
At the time of writing, the version was 2.33, and I reasoned that this new distribution would become available in the next version: 2.34.
The build script should work if you have R studio installed, which I don't (I have rstudio-server, in case you're wondering how I can survive). In such a case, you'll have to install both pandoc and pandoc-citeproc with
sudo apt install
(on Ubuntu 22.04, that is). Make sure the R packages bookdown, dplyr, ggplot2, kableExtra are installed with install.packages()
. However, I also found that I needed to install the Ubuntu packages libfontconfig1-dev, libcurl4-openssl-dev, libxml2-dev with sudo apt install
. Here's a screenshot of the new page in the functions reference pdf.
Step 5: adding the distribution to the stanc3
compiler
In terms of lines of code, this step is really simple. The stanc3 compiler is written in the OCaml language which might be
a bit unfamiliar (it certainly is for me). Also the OCaml ecosystem has a bit of a steep learning curve.
We start by cloning stanc3 using git clone git@github.com:stan-dev/stanc3.git
. The file that contains all native functions and distributions
can then be found in stanc3/src/middle/Stan_math_signatures.ml
. We just have to add lines for dirichlet_multinomial_rng
and dirichlet
, and we can basically copy the code from the signatures of multinomial_rng
and multinomial
.
Instructions on how to install all dependencies and stanc can be found here. the stanc repository contains a script that does basically everything for you. However, if you want to install OCaml yourself, check this page. You will need the dune package. I started working on this a while ago, and when I picked up the work recently, I found that I had to update my OCaml version. This can be done with
opam switch create [version number]
(with the right version number).
Naturally, the stanc compiler has its own tests which includes a couple of Stan programs to test function signatures. The Stan program
stanc3/test/integration/good/function_signatures/distributions/rngs.stan
contains function calls for all RNGs.
We can add our new RNG to this file by adding the line
ns = dirichlet_multinomial_rng(alpha, 20);
The variables ns
and alpha
are already defined or declared.
I then added a folder stanc3/test/integration/good/function-signatures/distributions/multivariate/discrete/dirichlet_multinomial
and created a file dirichlet_multinomial_lpmf.stan
in this folder.
I basically copied the content of
the test file for the Multinomial distribution (located at stanc3/test/integration/good/function-signatures/distributions/multivariate/discrete/multinomial
)
and changed multinomial
to dirichlet_multinomial
. The fact that the multinomial distribution takes simplices and the Dirichlet Multinomial positive vectors does not seem to matter for these tests: it's just about function signatures. This is the test file that I added:
Finally, Use
dune runtest
to check changes in the code, and then run dune promote
to accept these changes. Otherwise the integration tests (see Step 7 below) will fail. More instructions on exposing new functions can be found here
Step 6: Compiling a model with cmdstanpy
To test if everything works as expected, I will now write a Stan model that can be used to estimate intensity parameters of the
Dirichlet-Multinomial distribution, given some data that was sampled from this distribution with known parameters.
To make life easy (i.e. no messing around with cmdstan, json files etc.),
I want to do this in Python using the cmdstanpy
interface. Let's first write our Stan model.
This is pretty straightforward. We have \(K\) categories, \(N\) observations \(X\) and for sampling pseudo-observations, we also pass a population size parameter \(M\) to the model. In the parameters block, we define \(\alpha \in \mathbb{R}_+^K\), which is the intensity, and in the model block we give \(\alpha\) a weakly-informative log-normal prior. We have to write the "sampling statements" for \(X\) in a for loop, as the Dirichlet-Multinomial distribution does not allow vectorization at this point. In the generated quantities block, we sample "fake data" \(Y\) and compute the likelihood of the data \(X\), again in a for loop. This is not the most exciting model, but it's good enough for this test.
Important note: The "linking" solution presented in the following section works, but there is a better "official" solution, which I explain a little bit later.
So far, we have only used the Stan Math library in C++ directly. In order to get cmdstan to work with (1) our extended Stan Math library, and (2) the updated stanc compiler, we have to link all these components together. First, we have to point
stan
to the right math library. If you clone stan from github
(with git clone git@github.com:stan-dev/stan.git
) you will find a folder stan/lib/stan_math
. This folder is empty and should contain the contents for the Stan Math library. It is possible replace the folder stan_math
with a link to the math
folder. First cd
to the stan/lib
folder and remove the empty stan_math
folder. Then create a link with
ln -s path/to/math stan_math
Then clone cmdstan
with git clone git@github.com:stan-dev/cmdstan.git
. The cmdstan folder contains an empty stan
folder. Replace this with a link to the modified stan folder.
This approach of creating links worked for me, but there is an easier way. It is possible to clone cmdstan from github "recursively", i.e. also clone the content of the
stan
folder, and the content of the lib/stan_math
folder at once.
Download and installation instructions for cmdstan can be found here.
As indicated in the "Installing from GitHub repo" section, we can clone the stan
and math
repositories with make stan-update
from the cmdstan
folder. Or, instead of just cloning the cmdstan repository, you can clone recursively with git clone https://github.com/stan-dev/cmdstan.git --recursive
. This cloning is quite slow, because you don't only get the latest version, but also the entire history of Stan development. To speed things up, you could do a "shallow" clone with the additional flags --depth 1 --shallow-submodules
. We can then build cmdstan with make build
. This will also download a pre-built stanc compiler.
This is not quite what we want, though, because we have to link to our own version of the math
library, and
our modified stanc
compiler.
To tell cmdstan that it should use our modified
stanc
version, we have to create a file called local
in the folder
cmdstan/make
. In fact, there is a file local.example
already present in this folder which contains come other (commented-out) options.
We can copy this file with cp local.example local
and then open it in a text editor and add the line STANC3=path/to/stanc3
.
I have stanc3
and cmdstan
in the same folder, so for me this line is STANC3=../stanc3
It turns out that a similar trick can be used to use a custom
math
repository, and so we don't have to create symbolic links. In the cmdstan
makefile, you'll find these lines:
-include make/local # user-defined variables STAN ?= stan/ MATH ?= $(STAN)lib/stan_math/The
?=
assignment in makefiles only works when the variable on the left-hand-side hasn't already been defined. So if we define MATH=path/to/math
in our make/local
file (which is included in the main makefile), then we override the MATH ?= stan/lib/stan_math/
assignment. We could do the same for the STAN
variable.
To compile and run the above Stan model with
cmdstanpy
, we have to make sure cmdstanpy
can find the right cmdstan installation.
This can simpy be done using cmdstanpy.set_cmdstan_path("path/to/cmdstan")
where "path/to/cmdstan" is the location of the modified cmdstan installation.
The python script makes a plot of the ground truth parameter vector
alpha
(black dots), together with gray violins representing the posterior distribution.
Judging from this plot, everything seems to be working.
It also helps to look at the diagnostics (with print(sam.diagnose())
in python). They look reassuring:
Processing csv files: /tmp/tmp_vwn59j0/dirmultqq672ujg/dirmult-20231126104727_1.csv, ...
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
Step 7: creating a pull request (PR) on GitHub
As Stan has multiple components, we have to submit multiple PRs. First, we have to submit our code to the Stan Math library. Then we have to provide new documentation for the functions reference, and then we have to submit our modifications to stanc3.Stan Math PR
I will start with creating a PR for the math library. All PRs must correspond to an issue. So typically you first "request" a feature and then submit a PR to add it. In this case, there already exists an issue with a feature request for the Dirichlet-Multinomial distribution (from 2014), so we'll use that.There's a lot of information to digest before you make a PR. A place to start is the "contributing" page. A much more detailed (and essential) "Developer process overview" can be found here, although these instructions are sometimes specific for the
stan
repository (not math
). Then there is also this Introduction to Stan for New Developers.
It is also a good idea to have a look at the pull request template (which is math
specific).
On GitHub, fork the Stan math repository. Then clone the fork to your local machine. Make sure you're on the develop branch with
git branch
, and create a new branch with git checkout -b feature/issue-NN-short-description
(this creates and switches to the new branch) where NN is the issue number and "short-description" is... a short description of the feature.
In Steps 1-6, I had been working on just a clone of the math repository. So at this point I copied all new files and applied changes to the header file. The Stan team asks that you only commit well formatted code (more about this later), so be careful with what you commit to the feature branch.
When you're ready, add files with git add path/to/file
and commit changes with git commit -a -m "explain what changed"
. The -a
option makes sure that all tracked files are added before the commit. To push the new brach to the remote fork, use git push -u origin feature/issue-NN-short-description
There are some requirements not covered in Steps 1-6 above. To make sure your code passes Stan's automated tests, these tests should pass:
make test-headers
make test-math-dependencies
runTests.py test/unit
make doxygen
make cpplint
For me, cpplint failed as lines were longer than 80 characters, and some lines ended in whitespace.
In order to run cpplint, I had to install python2 (as Ubuntu 22.04 only has python3) with sudo apt install python2
.
On the web, there are some pages explaining how to make python2 the default python version. I think that's
a really bad idea. However, you can tell make to use python2 with
export PYTHON=python2
make cpplint
The rule make test-math-dependencies did not work for me, but this make target just points to ./runChecks.py
which has a hash-bang that did not work for me. So I just ran the script directly with
python3 runChecks.py
This ran without producing any output. Which means no errors were found.
My first PR failed very quickly. It turned out that passing cpplint was not sufficient to ensure proper formatting. To make sure that your code is formatted correctly, you have to use clang-formatter. Installing clang-format in Ubuntu requires that you to first add the PPA for LLVM. Instructions are found here. On ubuntu 22.04, you would need the line "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy main". First get the signature with "wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -" (see section "Install (stable branch)". Then add the PPA with
sudo add-apt-repository
followed by the PPA string. It should also be possible to use the
GUI (software and updates). On the coding style page instructions are given to install clang-format-5.0
. For me,
this did not work, but installing clang-format
(without the 5.0) did.
In the hooks
folder, a script is provided to install a pre-commit git "hook". This makes sure that clang-format is run on all cpp and hpp files that were changed.
My second PR failed as well. This time at the "verify changes" stage I got the following message
remote: Not Found
fatal: repository 'https://github.com/chvandorp/stan_math/math/' not found
error: pathspec 'forkedOrigin/feature/issue-54-dirichlet-multinomial'
did not match any file(s) known to git
script returned exit code 1
Which makes some sense because my fork is called stan_math
and not stan_math/math
. I renamed the repository math
to stan_math
to have a somewhat more descriptive name.
The solution to this was simple: just change the name of the fork on GitHub.
A name change is automatically updated the PR. All you have to do is change the remote URL in your local .git/config
file.
In the meantime, the Stan team has fixed this "bug" in their testing pipeline: it should now be possible to choose a different name for your fork.
After attempt 2, I also received a request to add an additional unit test. This test had to be added to
test/unit/math/mix/prob/
using the expect_ad
function. I was able to use the Dirichlet distribution as an example. expect_ad
compares the AD with a value based on finite difference method. These tests also pass if an exception is thrown. It then just checks that all AD types also lead to the same exception. With these changes I tried submitting the PR again.
This time all checks passed.
Revisions
Mydirichlet_multinomial_lpmf
function is written in an ugly imperative style, while the Eigen and Stan math libraries provide methods to vectorize a lot of these operations. A Stan developer re-wrote my function in a much nicer, more functional style, and also added analytic derivatives.
I definitely do not understand all details of this revised function, but let's go through it anyway. The first line that's different defines a variable
alpha_val
as follows
const auto& alpha_val = as_value_array_or_scalar(alpha_ref);
As I understand it, the point of this is that if we want to compute gradients ourselves, then we want to temporarily "turn off" automatic differentiation. Hence, we have to use the "values" of Stan autodiff types instead of the full types in our computations.
This is what as_value_array_or_scalar
does.
Next, we see that Stan math provides an overloaded function sum
that takes sums of a lot of things. So there's no need for for loops or std::accumulate
.
The integer valued ns
are then cast to double
values using Eigen's cast<double>
method (I'm not sure if the template disambiguater is required here).
Next, we are going to compute the log-probability
lp
.
First, we add \(\log(N) - \sum_{k : x_k > 0} \log(x_k)\) to lp
, but only if we are interested in constants.
Notice that we can do elementwise comparison ns_array > 0
on Eigen arrays, and use the select
method to compute an elementwise ternary operation.
One important thing to realize here is that the computations are "lazy". This means although the expression log(ns_array)
occurs in the C++ code, the log of the
elements of ns_array
that are zero are never evaluated.
Similarly, we add \(\log(B(\alpha_0, N)) - \sum_{k : x_k > 0} \log(B(\alpha_k, x_k))\) to lp
.
Now we will compute the gradient. For this, we use a object of the class
partials_propagator
, using the function make_partials_propagator
.
This function takes a variable number of arguments, one for each of the parameters or random variable that might require a gradient. In this case, this only holds for
alpha
, because ns
is always constant.
To put the "manually calculated" gradient in the right "slot", we use the partials<0>
function, where the template argument 0
is used because we only had a single non-constant parameter alpha
. The gradient of the distribution is given by
\[
\frac{\partial\log(p(x | \alpha))}{\partial\alpha_i} =
\frac{\partial\log(B(\alpha_0, N))}{\partial \alpha_i} - \sum_{k : x_k > 0} \frac{\partial \log(B(\alpha_k, x_k))}{\partial \alpha_k}
\]
Then we write \( \log(B(a, b)) = \log(\Gamma(a)) + \log(\Gamma(b)) - \log(\Gamma(a+b)) \) and consider two cases, \(x_i = 0\) (case I) and \(x_i > 0\) (case II).
In case I, we get
\[
\frac{\partial\log(p(x | \alpha))}{\partial\alpha_i} = \psi(\alpha_0) - \psi(\alpha_0 + N)
\]
where \(\psi(a) = \Gamma'(a) / \Gamma(a) \) denotes the digamma function.
In case II, we get
\[
\frac{\partial\log(p(x | \alpha))}{\partial\alpha_i} = \psi(\alpha_0) - \psi(\alpha_0 + N) - \psi(\alpha_i) + \psi(\alpha_i + x_i)
\]
Notice that we can use the latter expression for both cases, because \(\psi(\alpha_i) - \psi(\alpha_i + x_i) = 0\) when \(x_i = 0\). However,
evaluating the digamma function is not free. Therefore we use the select
method.
The build
method of the partials_propagator
class then combines gradients and values in the right way and returns the appropriate value.