Tuesday 6 February 2018

Computing q-values with C++

When looking for associations between features \(i = 1,\dots, m\) and some trait, it is often necessary to have some sort of multiple-testing correction. A very conservative method is the Bonferroni correction, that minimizes the family-wise error rate (FWER), but at the cost of many false negatives. This is not desirable when one wants to discover features or associations, and therefore other methods have been developed. One particularly intuitive method is based on the false discovery rate (FDR) and uses so-called \(q\)-values, which are (under certain conditions) elegantly analogous to \(p\)-values.

False discovery rate

Let \(S\) be the number of features called significant, and \(F\) the number of false positives among the significant features (i.e. false discoveries). In an article by John D. Storey, The (positive) false discovery rate is defined as \[ {\rm pFDR} := \mathbb{E}\left[\left.\frac{F}{S}\right| S > 0 \right]\,. \] Hence, the pFDR is the expected fraction of false positives among the features that are called significant. The condition \(S > 0\) ensures that it is well defined.

In the case of hypothesis testing, one typically has a test statistic \(T\), and one wants to test if the null hypothesis is true (\(H = 0\)), or rather that the alternative hypothesis is true (\(H = 1\)). The statistical model specifies the distribution of \(T | H = 0\), and the null hypothesis is rejected when the realization of \(T\) falls into a pre-defined significance region \(\Gamma\).
When testing multiple features, we typically have a sequence \((T_i, H_i)_{i=1}^m\), here assumed to be identically distributed and independent. The \({\rm pFDR}\) then depends on \(\Gamma\): \[ {\rm pFDR}(\Gamma) = \mathbb{E}\left[\left.\frac{F(\Gamma)}{S(\Gamma)}\right| S(\Gamma) > 0 \right] \,, \] where \(F(\Gamma) := \#\{i : T_i \in \Gamma \wedge H_i = 0 \} \) and \(S(\Gamma) := \#\{i : T_i \in \Gamma\}\). Storey derives that under certain conditions, we can write \[ {\rm pFDR}(\Gamma) = \mathbb{P}[H = 0 | T \in \Gamma] = \frac{\mathbb{E}[F(\Gamma)]}{\mathbb{E}[S(\Gamma)]} \]

The q-value

Let \(\{\Gamma_{\alpha}\}_{\alpha=0}^1\) be a nested family of significance regions. That is \(\Gamma_{\alpha} \subseteq \Gamma_{\alpha'}\) whenever \(\alpha \leq \alpha'\) and \(\mathbb{P}[T \in \Gamma_{\alpha} | H=0] = \alpha\). For instance, if \(T | H = 0 \sim \mathcal{N}(0,1)\), then we could choose \(\Gamma_{\alpha} = [z_{\alpha}, \infty)\), where \(z_{\alpha} = \Phi^{-1}(\alpha)\), with \(\Phi\) the CDF if \(\mathcal{N}(0,1)\).
The \(q\)-value of a realization \(t\) of \(T\) is then defined as \[ q(t) = \inf_{\Gamma_{\alpha} | t \in \Gamma_{\alpha}} {\rm pFDR}(\Gamma_{\alpha}) \] We can now give the above-mentioned analogy between \(p\)-values and \(q\)-values. The \(p\)-value is defined as: \[ p(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \mathbb{P}[T \in \Gamma_{\alpha} | H = 0] \] While under the right conditions, the \(q\)-value can be written as: \[ q(t) = \inf_{\{\Gamma_{\alpha} : t \in \Gamma_{\alpha}\}} \mathbb{P}[H = 0 | T \in \Gamma_{\alpha} ] \]

Computing q-values

In order to compute \(q\)-values, given a sequence of \(p\)-values, we follow the steps given in a paper by Storey and Tibshirani In this scenario, the \(p\)-value plays the role of the realization \(t\) of the statistic \(T\). Under the null hypothesis, these \(p\)-values are uniformly distributed. As a family of significance regions, we simply take \(\Gamma_{\alpha} = [0,\alpha]\), and write for instance \(S(\alpha) := S(\Gamma_{\alpha})\).

First, we have to estimate \(\mathbb{E}[S(\alpha)]\), for which we use \(\#\{i : p_i \leq \alpha\}\), and we estimate \(\mathbb{E}[F(\alpha)]\) with \(m \cdot \hat{\pi}_0 \cdot \alpha\), where \(\hat{\pi}_0\) is an estimate for \(\pi_0 = \mathbb{P}[H=0]\).
The most difficult part of computing \(q\)-values is estimating \(\pi_0\). An often used estimate is the average \(p\)-value, but we can do a little bit better by making use of the empirical CDF of the \(p\)-values. The figure below shows a "typical" histogram of \(p\)-values, where the \(p\)-values sampled from the alternative distribution are given by the gray bars. The marginal distribution of \(p\)-values becomes relatively flat towards the right, where most \(p\)-values should come from the null distribution, which is \({\rm Uniform}(0,1)\). The right panel of the figure shows a "rotated" empirical CDF of the \(p\)-values, i.e. \[ x \mapsto R(x) = 1-{\rm CDF}(1-x)\,, \] and the fact that large \(p\)-values should be uniformly distributed is resembled by the fact that \(R(x)\) is a straight line for small \(x\). The slope of this straight line is an estimate of \(\pi_0\). In the C++ code below, I use GSL to fit a line though \(R\) (indicated by the red line in the figure), using a weight function \(x \mapsto (1-x)^2\) to give more precedence to small values of \(x\), thereby mostly ignoring the non-straight part of \(R\).

In this example, the marginal \(p\)-values are sampled from a mixture distribution \(\pi_0 {\rm Uniform}(0,1) + (1-\pi_0) {\rm Beta}(10,1)\), where \(\pi_0 = 3/4\).


Now we sort the \(p\)-values such that \(p_1 \leq p_2 \leq \dots \leq p_m\), such that \(\#\{j : p_j \leq p_i\} = i\) and first determine the \(q\)-value corresponding to \(p_m\): \[ q_m = \hat{\pi}_0 \cdot p_m \] The \(q\)-value \(q_i\) corresponding to \(p\)-value \(p_i\) is computed as \[ q_i = \min(\hat{\pi}_0 p_i m/i, q_{i+1}) \] Recently, I had to implement this algorithm in C++. The function is given in the following header file, and accepts a "keyed" list of \(p\)-values, where the key is used to identify the feature. The function returns a keyed list of \(q\)-values.



The following code gives an example of how to use qvalues.hpp, and can be used for testing:


After compiling and running the program, the result should be something like:
$ g++ -O3 -std=c++11 qvalues_test.cpp -lgsl -lgslcblas -o qvaltest
$ ./qvaltest 1728
                true false
  discoveries:   143    32
    negatives:   712   113
realized FDR: 0.182857