Processing math: 100%

Thursday, 1 November 2018

The simplest example on c++11 <random>

For some reasons, the internet lacks a beginner-level tutorial on how to use the c++11 <random> library when your code is made up of different files.
This is not a tutorial, this is just what I managed to put together...

I want
  • one global random engine (e.g. the mersenne twister)
  • one real-value uniform distribution in the interval [0,1), let's call this function RANDOM()
  • to be able to call RANDOM() from anywhere in my code.

There are three files: random.h, random.cpp and main.cpp (any additional .cpp file that includes random.h can use the function RANDOM() ).
The content of the files is as follow:
 
random.h
// random.h
#ifndef _RND_HH_
#define _RND_HH_

#include <random> //--- FOR THIS YOU NEED c++11, enable with -std=c++11 flag

// Declare engine - single instance for the whole code
//extern std::mt19937 my_rng;
extern std::mt19937_64 my_rng;

//Declare distributions:
extern std::uniform_real_distribution<double> my_unif_real_dist;
//extern std::uniform_int_distribution<double> my_unif_int_dist;

int Seed(int seed);
double RANDOM();

#endif 

// end of random.h 
random.cpp
//random.cpp
#include <stdio.h>
#include <iostream>
#include <chrono>
#include "random.h"

//std::mt19937 my_rng {}; 
std::mt19937_64 my_rng {}; // Defines an engine
std::uniform_real_distribution<double> my_unif_real_dist(0., 1.); //Define distribution
// Function to seed the random number generator from main file
// useful if you want the seed from a parameter file
// a negative value for seed gets you a random seed
// outputs the seed itself
int Seed(int seed)
{
  if (seed < 0) {
    long rseed=static_cast<long unsigned int>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
    std::cerr << "Randomizing random generator, seed is "<<rseed<<std::endl;
    my_rng.seed(rseed);
    return rseed;
  } else {
    std::cerr << "User-provided seed is "<<seed<<std::endl;
    my_rng.seed(seed);
    return seed;
  }
}
// This is the function to call if you want a random number in the interval [0,1)
double RANDOM(void)
{
  return my_unif_real_dist(my_rng);
}
// end of random.cpp
And finally for the main.cpp file
//main.cpp
#include <stdio.h>
#include <iostream>
#include "random.h"

int main()
{
  int max = 10;
  int my_seed = 235;
  
  int my_new_Seed = Seed(my_seed);
  
  for(int i=0; i<max;++i){
    double one_random_number = RANDOM();
    std::cerr << RANDOM() << std::endl;
  }
}// end of main.cpp
That's it! This is really all you need.
Compile it with:
g++ -std=c++11 random.cpp main.cpp -o my_pretty_random_numbers
and happy random number generation.

By the way, this seems to work fine on my machine (running Ubuntu 18).
Can this be improved in simplicity and/or performance? Are there bugs?
Please let me know!

Sunday, 3 June 2018

Easy (Bayesian) multidimensional scaling with Stan

Multidimensional scaling (MDS) is a data visualization technique in which the dimension of the data is reduced in a non-linear way. The data is represented as a N×N distance matrix (dij)ij, and N points xi in a D dimensional space (typically D=2) are chosen such that the Euclidean distances xixj resemble the input distances dij "as good as possible".

In metric MDS, an objective function E(x)=1i<jN(dijxixj)2 is defined that needs to be minimized. For different flavors of MDS, this objective function is defined differently. In order to minimize the objective function, e.g. the conjugate gradient descent method is used. This method requires that one calculates the gradient E of the objective function. Of course, this is not so difficult is the case of metric MDS, but more difficult objective functions might require more effort. Enter Stan.

Stan uses automatic differentiation for Hamiltonian Monte Carlo, but Stan can also be used for maximum likelihood. Hence, if we can formulate the MDS problem in terms of a likelihood function, we can let Stan do all the work. The parameters of the model are the xi, the data is given by the distances dij. If we assume that given the parameters, the data is distributed as dijN(xixj,σ2), then maximizing the (log) likelihood is equivalent to minimizing the function E. The parameter σ2 is merely a nuisance parameter that needs to be estimated as well.

An implementation of MDS in the Stan programming language

Implementing MDS in Stan is fairly straightforward, but there are a few snags that we should be aware of. First, if x solves the MDS problem, then also any Euclidean transformation of x is a solution. Hence, the model as stated above has too many parameters. We solve this by fixing the first point at the origin, restricting the next point to a 1-dimensional half space, the third point to a 2-dimensional half space et cetera. The last ND1 points are unrestricted. In Stan, we can accomplish this by using a cholesky_factor_cov matrix: A positive-definite lower-trangular matrix. We then use the transformed parameters block to concatenate the points together into a single matrix.

Secondly, the data that we use in the example below is highly censored. Many of the distances are missing, and some are right censored. In such a case MDS can be used to infer the missing distances, and not merely visualize the data. The data that is passed to Stan, therefore, is a list of edges, a list of distances, and a list of codes that determine the type of censoring.

Thirdly, as the title of this post suggests, we will use Stan to do some sort of Bayesian MDS. In this case, we will sample a collection of "maps" x from a posterior distribution, that gives information about the location of each point, but also the uncertainty of this location. In this case, the fact that we restrict the first D+1 points, comes back to bite us, as the uncertainty of these points will be different than the unrestricted points. Furthermore, it might be hard to compare the individual maps to one another, and for instance compute sensible mean locations of the points, as some maps may be "twisted" more than others. Therefore, we use the generated quantities block to center and rotate (cf. PCA) the sampled maps.

data {
int<lower=1> N; // number of nodes
int<lower=0, upper=(N*(N-1))/2> E; // number of edges
int<lower=1, upper=N-1> D; // target dimension
int<lower=1, upper=N> edges[E,2]; // a list of edges (i1, i2)
vector<lower=0>[E] distances;
int<lower=0, upper=3> censoring[E]; // censoring of distances
/* 0: uncensored,
* 1: left-censored,
* 2: right-censored,
* 3: missing
*/
}
parameters {
/* the first point is fixed at the origin (0)
* the next D points are stored in a lower triangular matrix (X1)
* the remaining points are given by a N-D-1 x D matrix (X2)
*/
cholesky_factor_cov[D] X1;
matrix[D, N-D-1] X2;
real<lower=0> sigma; // nuisance parameter
}
transformed parameters {
// put all coordinates in a single matrix X
matrix[D, N] X; // [0, X1', X2]
X = append_col(rep_vector(0, D), append_col(X1', X2));
}
model {
// try to keep sigma small
sigma ~ exponential(1);
// calculate the current distances (between the Xs)
for ( e in 1:E ) {
int i1; int i2;
real dist;
// compute the distance
i1 = edges[e][1];
i2 = edges[e][2];
dist = distance(col(X, i1), col(X, i2));
// likelihood
if ( censoring[e] == 0 ) { // uncensored
distances[e] ~ normal(dist, sigma);
} else if ( censoring[e] == 1 ) { // left-censored
target += normal_lcdf(distances[e] | dist, sigma);
} else if ( censoring[e] == 2 ) { // right-censored
target += normal_lccdf(distances[e] | dist, sigma);
} else if ( censoring[e] == 3 ) { // missing
// do nothing
}
}
}
generated quantities {
vector[D] meanX; // the means of each MCMC sample
matrix[D, N] Xc; // centered positions
matrix[D, N] Xcr; // centered and rotated positions
// center X
for ( i in 1:D ) {
meanX[i] = mean(row(X, i));
}
Xc = X - rep_matrix(meanX, N);
// rotate Xc
Xcr = eigenvectors_sym(tcrossprod(Xc))' * Xc;
}
view raw mds_model.stan hosted with ❤ by GitHub


Example: Antigenic cartography of the influenza virus

An interesting application of MDS is antigenic cartography of the influenza virus. Influenza virus circumvents human antibody responses by continually evolving its surface proteins, in particular, hemagglutinin (HA). This is known as antigenic drift. In order to decide whether flu vaccines need to be updated, the hemagglutination inhibition (HI) assay is used to determine if the induced antibody response is still effective against future strains. The titers measured in the HI assay can be used to define "distances" between antisera and antigens. Using MDS, the antisera and antigens can be drawn into a "map", that shows the antigenic drift of the virus. This was done by Smith et al. in 2004. Conveniently, the data used for their map is available online. This table gives HI titers Hij of antigen i and antiserum j. A small titer corresponds to a large distance, which are defined as dij=log2(Hmax,j)log2(Hij), where Hmax,j=maxkHkj. As an example, I recreated their antigenic map using the Stan model above, and a Python script below.


The white squares denote the relative positions of the antisera in the "antigenic space", while the colored circles represent the antigens. The colors map to the years in which the infuenza strains were isolated.

Bayesian multidimensional scaling

For antigenic cartography of IAV, Bayesian MDS has been introduced by Bedford et al., who used multiple HI assay results per antigen/antiserum pair to incorporate the uncertainty of these measurements in their antigenic map. Moreover, they were able to use genetic and temporal information about the antigens (i.e. the RNA sequences of HA and their isolation dates) to inform the position of the antigens and antisera on the map. We will not go this far in this post, but since we have already formulated the MDS algorithm in Stan, we might as well make a "Bayesian" antigenic map. This can give some insight into the uncertainty of the positions of the antigens and antisera. This is not unlike the confidence areas as drawn by Smith et al. (the ellipsoid shapes). The result is given by the following figure.


Again, squares indicate antisera and colored circles the antigens. All the individual MCMC samples are represented by the grey dots. The MCMC samples for each antigen or antiserum are used to draw a two-dimensional error bar (i.e. ellipse) around the mean location.
A Python script for parsing the HI titer data, compiling and running the Stan model and drawing the maps is added below. For it to work, you will need to download the mds_model.stan file and make a csv file called baselinemap.csv with the HI table

import csv
import pystan
import pickle
import numpy as np
import hashlib
import matplotlib
import matplotlib.pyplot as plt
################ some auxiliary functions ###############
def CachedStanModel(model_code, model_name=None):
"""
Function to cache compiled Stan models. See:
https://pystan.readthedocs.io/en/latest/
"""
code_hash = hashlib.md5(model_code.encode('ascii')).hexdigest()
if model_name is None:
cache_fn = 'cached-model-{}.pkl'.format(code_hash)
else:
cache_fn = 'cached-{}-{}.pkl'.format(model_name, code_hash)
try:
sm = pickle.load(open(cache_fn, 'rb'))
except:
sm = pystan.StanModel(model_code=model_code)
with open(cache_fn, 'wb') as f:
pickle.dump(sm, f)
else:
print("Using cached StanModel")
return sm
def mkEllipse(Xt, scale=1):
"""
Function to make an ellipse that covers a point cloud.
The scale parameter can be used to rescale the ellipse.
"""
## for each point cloud, do a 'PCA'
meanXt = np.mean(Xt, axis=0)
## center the cloud to do eigen decomposition
Xtprime = np.array([X - np.mean(Xt, axis=0) for X in Xt])
C = np.dot(Xtprime.transpose(), Xtprime) / Xt.shape[0]
eivals, U = np.linalg.eigh(C)
## compute angle (in degrees)
angle = np.arccos(U[0,0]) * np.sign(U[0,1]) * 360/(2*np.pi)
height = scale * 2 * np.sqrt(eivals[1])
width = scale * 2 * np.sqrt(eivals[0])
ell = matplotlib.patches.Ellipse(xy=meanXt,
width=width, height=height, angle=angle)
return ell
def getYear(lab):
"""
Single purpose function to get sampling year
from antigen/antiserum label.
"""
if lab[-3:]=="REC":
yr_str = lab[-5:-3]
else:
yr_str = lab[-2:]
yr = int(yr_str)
## a bodge for the Y2K bug
return yr + 2000 if yr < 50 else yr + 1900
############### compile Stan model and HI data ###############
## open the test file containing the Stan model and compile it.
with open("mds_model.stan", 'r') as f:
mds_model = f.read()
sm = CachedStanModel(model_code=mds_model)
## import influenza HI data
with open("baselinemap.csv", 'r') as f:
## NB: this depends on how you saved the HI data
reader = csv.reader(f, delimiter=';')
table = [row for row in reader]
## the header starts at the first field
## (for you, this could be the second field)
sr_labs = ["SR_" + elt for elt in table[0] if elt != '']
## the first column contains the antigen labels
ag_labs = ["AG_" + row[0] for row in table[1:]]
values = [row[1:] for row in table[1:]]
labs = sr_labs + ag_labs
nodes = list(range(1, len(labs)+1))
labDict = {n : l for n, l in zip(nodes, labs)}
nodeDict = {l : n for n, l in zip(nodes, labs)}
## uncensored titers
uTiterDict = {
(nodeDict[l1], nodeDict[l2]) : np.log2(float(values[i2][i1]))
for i1, l1 in enumerate(sr_labs)
for i2, l2 in enumerate(ag_labs)
if values[i2][i1] != '*' and values[i2][i1][0] != '<'
}
## left-censored titers (i.e. right-centered distances)
cTiterDict = {
(nodeDict[l1], nodeDict[l2]) : np.log2(float(values[i2][i1][1:]))
for i1, l1 in enumerate(sr_labs)
for i2, l2 in enumerate(ag_labs)
if values[i2][i1] != '*' and values[i2][i1][0] == '<'
}
edges = list(uTiterDict.keys()) + list(cTiterDict.keys())
edges.sort()
titerDict = {
e : w for e, w in list(uTiterDict.items()) + list(cTiterDict.items())
}
## censoring: 0 means uncensored, 2 means right-censored
censorDict = {e : 0 for e in uTiterDict.keys()}
censorDict.update({e : 2 for e in cTiterDict.keys()})
## find the maximum titer for each antiserum
maxTiterDict = {
l : np.max([logH for (i1, i2), logH in uTiterDict.items() if labDict[i1]==l])
for l in sr_labs
}
## the distance d_{ij} is defined as log_2(H_{max,j}) - log_2(H_{i,j})
distanceDict = {
e : maxTiterDict[labDict[e[0]]] - w for e, w in titerDict.items()
}
distances = [distanceDict[e] for e in edges]
censoring = [censorDict[e] for e in edges]
data = {
'D' : 2,
'E' : len(edges),
'N': len(nodes),
'distances' : distances,
'censoring' : censoring,
'edges' : edges
}
############# use Stan to minimize the MDS error ############
fit_opt = sm.optimizing(data=data)
Xs_opt = fit_opt["Xcr"]
optXts = Xs_opt.transpose()
fig = plt.figure(figsize=(5,8))
ax1 = fig.add_subplot(111, aspect='equal')
optxs = [X[0] for X in optXts]
optys = [X[1] for X in optXts]
optxs_ag = [m for m, n in zip(optxs, nodes) if labDict[n][:2]=="AG"]
optys_ag = [m for m, n in zip(optys, nodes) if labDict[n][:2]=="AG"]
optxs_sr = [m for m, n in zip(optxs, nodes) if labDict[n][:2]=="SR"]
optys_sr = [m for m, n in zip(optys, nodes) if labDict[n][:2]=="SR"]
year_ag = [getYear(labDict[n]) for n in nodes if labDict[n][:2]=="AG"]
C = ax1.scatter(optxs_ag, optys_ag, s=10, c=year_ag, cmap='viridis',
linewidth=1, edgecolor='k', zorder=2)
ax1.scatter(optxs_ag, optys_ag, s=5, c=year_ag, cmap='viridis',
linewidth=0, zorder=3)
fig.colorbar(C, ax=ax1, shrink=0.3)
ax1.scatter(optxs_sr, optys_sr, s=10, c='w', marker='s',
linewidth=1, edgecolor='k', zorder=4)
ax1.set_ylabel("Antigenic dimension 1")
ax1.set_xlabel("Antigenic dimension 2")
fig.savefig("mds-iav-ag-cart.png", dpi=300, bbox_inches='tight')
################# now do some Bayesian MDS ################
fit = sm.sampling(data=data, chains=1, iter=1000, warmup=500)
la = fit.extract(permuted=True)
Xs = la['Xcr']
Xts = [np.array([Xs[i].transpose()[j] for i in range(len(Xs))])
for j in range(len(nodes))]
meanXts = [np.mean(Xt, axis=0) for Xt in Xts]
ells = [mkEllipse(Xt, scale=1) for Xt in Xts]
fig = plt.figure(figsize=(5,8))
ax1 = fig.add_subplot(111, aspect='equal')
meanxs = [meanXt[0] for meanXt in meanXts]
meanys = [meanXt[1] for meanXt in meanXts]
meanxs_ag = [m for m, n in zip(meanxs, nodes) if labDict[n][:2]=="AG"]
meanys_ag = [m for m, n in zip(meanys, nodes) if labDict[n][:2]=="AG"]
meanxs_sr = [m for m, n in zip(meanxs, nodes) if labDict[n][:2]=="SR"]
meanys_sr = [m for m, n in zip(meanys, nodes) if labDict[n][:2]=="SR"]
year_ag = [getYear(labDict[n]) for n in nodes if labDict[n][:2]=="AG"]
C = ax1.scatter(meanxs_ag, meanys_ag, s=10, c=year_ag, cmap='viridis',
linewidth=1, edgecolor='k', zorder=2)
ax1.scatter(meanxs_ag, meanys_ag, s=5, c=year_ag, cmap='viridis',
linewidth=0, zorder=3)
fig.colorbar(C, ax=ax1, shrink=0.3)
ax1.scatter(meanxs_sr, meanys_sr, s=10, c='w', marker='s',
linewidth=1, edgecolor='k', zorder=4)
## define a colormap to color the ellipses
cmap = matplotlib.cm.get_cmap('viridis')
norm = matplotlib.colors.Normalize(vmin=np.min(year_ag), vmax=np.max(year_ag))
## draw ellipses
for n, ell in enumerate(ells):
ax1.add_artist(ell)
ell.set_clip_box(ax1.bbox)
ell.set_alpha(0.5)
lab = labDict[n+1]
if lab[:2]=="AG":
ell.set_facecolor(cmap(norm(getYear(lab))))
else:
ell.set_facecolor("darkgray")
ell.set_linewidth(0)
## plot individual samples
for Xt in Xts:
xs = [x[0] for x in Xt]
ys = [x[1] for x in Xt]
ax1.scatter(xs, ys, s=0.5, color='lightgray', alpha=0.3, zorder=1)
ax1.set_ylabel("Antigenic dimension 1")
ax1.set_xlabel("Antigenic dimension 2")
fig.savefig("bmds-iav-ag-cart.png", dpi=300, bbox_inches='tight')
view raw mds_stan.py hosted with ❤ by GitHub

Tuesday, 6 February 2018

Computing q-values with C++

When looking for associations between features i=1,,m and some trait, it is often necessary to have some sort of multiple-testing correction. A very conservative method is the Bonferroni correction, that minimizes the family-wise error rate (FWER), but at the cost of many false negatives. This is not desirable when one wants to discover features or associations, and therefore other methods have been developed. One particularly intuitive method is based on the false discovery rate (FDR) and uses so-called q-values, which are (under certain conditions) elegantly analogous to p-values.

False discovery rate

Let S be the number of features called significant, and F the number of false positives among the significant features (i.e. false discoveries). In an article by John D. Storey, The (positive) false discovery rate is defined as pFDR:=E[FS|S>0]. Hence, the pFDR is the expected fraction of false positives among the features that are called significant. The condition S>0 ensures that it is well defined.

In the case of hypothesis testing, one typically has a test statistic T, and one wants to test if the null hypothesis is true (H=0), or rather that the alternative hypothesis is true (H=1). The statistical model specifies the distribution of T|H=0, and the null hypothesis is rejected when the realization of T falls into a pre-defined significance region Γ.
When testing multiple features, we typically have a sequence (Ti,Hi)mi=1, here assumed to be identically distributed and independent. The pFDR then depends on Γ: pFDR(Γ)=E[F(Γ)S(Γ)|S(Γ)>0], where F(Γ):=#{i:TiΓHi=0} and S(Γ):=#{i:TiΓ}. Storey derives that under certain conditions, we can write pFDR(Γ)=P[H=0|TΓ]=E[F(Γ)]E[S(Γ)]

The q-value

Let {Γα}1α=0 be a nested family of significance regions. That is ΓαΓα whenever αα and P[TΓα|H=0]=α. For instance, if T|H=0N(0,1), then we could choose Γα=[zα,), where zα=Φ1(α), with Φ the CDF if N(0,1).
The q-value of a realization t of T is then defined as q(t)=infΓα|tΓαpFDR(Γα) We can now give the above-mentioned analogy between p-values and q-values. The p-value is defined as: p(t)=inf{Γα:tΓα}P[TΓα|H=0] While under the right conditions, the q-value can be written as: q(t)=inf{Γα:tΓα}P[H=0|TΓα]

Computing q-values

In order to compute q-values, given a sequence of p-values, we follow the steps given in a paper by Storey and Tibshirani In this scenario, the p-value plays the role of the realization t of the statistic T. Under the null hypothesis, these p-values are uniformly distributed. As a family of significance regions, we simply take Γα=[0,α], and write for instance S(α):=S(Γα).

First, we have to estimate E[S(α)], for which we use #{i:piα}, and we estimate E[F(α)] with mˆπ0α, where ˆπ0 is an estimate for π0=P[H=0].
The most difficult part of computing q-values is estimating π0. An often used estimate is the average p-value, but we can do a little bit better by making use of the empirical CDF of the p-values. The figure below shows a "typical" histogram of p-values, where the p-values sampled from the alternative distribution are given by the gray bars. The marginal distribution of p-values becomes relatively flat towards the right, where most p-values should come from the null distribution, which is Uniform(0,1). The right panel of the figure shows a "rotated" empirical CDF of the p-values, i.e. xR(x)=1CDF(1x), and the fact that large p-values should be uniformly distributed is resembled by the fact that R(x) is a straight line for small x. The slope of this straight line is an estimate of π0. In the C++ code below, I use GSL to fit a line though R (indicated by the red line in the figure), using a weight function x(1x)2 to give more precedence to small values of x, thereby mostly ignoring the non-straight part of R.

In this example, the marginal p-values are sampled from a mixture distribution π0Uniform(0,1)+(1π0)Beta(10,1), where π0=3/4.


Now we sort the p-values such that p1p2pm, such that #{j:pjpi}=i and first determine the q-value corresponding to pm: qm=ˆπ0pm The q-value qi corresponding to p-value pi is computed as qi=min(ˆπ0pim/i,qi+1) Recently, I had to implement this algorithm in C++. The function is given in the following header file, and accepts a "keyed" list of p-values, where the key is used to identify the feature. The function returns a keyed list of q-values.

/** qvalues.hpp
* transform a keyed list of p-values into a keyed list of q-values
*/
#ifndef QVALUES_HPP_
#define QVALUES_HPP_
#include <map>
#include <list>
#include <vector>
#include <algorithm> // min, count_if
#include <stdexcept>
#include <gsl/gsl_fit.h> // gsl_fit_wmul
template<class KeyType>
std::map<KeyType, double> computeQValues(const std::map<KeyType, double> & pValues) {
int m = pValues.size();
// sort keys by p-values
std::list<KeyType> keys;
for ( auto kp : pValues ) {
keys.push_back(kp.first);
}
keys.sort([&](KeyType lk, KeyType rk){return pValues.at(lk) < pValues.at(rk);});
// make a partition of the interval [0, 1]
int n = int(2*sqrt(m));
double mesh = 1.0 / n;
std::vector<double> xs(n, 0.0);
for ( int i = 0; i < n; ++i ) {
xs[i] = (i+1) * mesh;
}
// make a rotated CDF, i.e.: rcdf(x) = 1-cdf(1-x)
std::vector<double> rcdf(n, 0.0);
for ( int i = 0; i < n; ++i ) {
auto pred = [&](std::pair<KeyType, double> pair){return pair.second < 1.0 - xs[i];};
rcdf[i] = 1.0 - std::count_if(pValues.begin(), pValues.end(), pred) / double(m);
}
// make a weight vector
std::vector<double> ws = xs;
std::for_each(ws.begin(), ws.end(), [](double & x){x = (1-x)*(1-x);});
// fit a line through the rotated cdf
double hatPi0, cov11, sumsq; // passed by ref to GSL function
int err_code_fit = gsl_fit_wmul(xs.data(), 1, ws.data(), 1, rcdf.data(),
1, n, &hatPi0, &cov11, &sumsq);
if ( err_code_fit != GSL_SUCCESS ) {
throw std::runtime_error("unable to estimate pi_0");
}
// compute the q-values, by starting with the largest p-value
std::map<KeyType, double> qValues; // to-be-returned
double q_previous = 1.0;
keys.reverse();
int i = m; // index in the for loop over the keys
for ( auto k : keys ) {
q_previous = std::min(q_previous, (m * hatPi0 * pValues.at(k)) / (i--));
// postfix i-- returns the old value
qValues[k] = q_previous;
}
return qValues;
}
#endif
view raw qvalues.hpp hosted with ❤ by GitHub


The following code gives an example of how to use qvalues.hpp, and can be used for testing:

/** qvalues_test.cpp
* program for testing qvalues.hpp
* compile with
* g++ -O3 -std=c++11 qvalues_test.cpp -lgsl -lgslcblas -o qvaltest
* then choose a seed for the random number generator (e.g. 1728) and run
* ./qvaltest 1728
*/
#include <iostream>
#include <iomanip> // for printing a decent table
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include "qvalues.hpp"
int main(int argc, char* argv[]) {
// choose a random seed
unsigned long seed = 144169;
if ( argc > 1 ) {
seed = atoi(argv[1]);
}
// initialize a random number generator
gsl_rng* rng = gsl_rng_alloc(gsl_rng_taus);
gsl_rng_set(rng, seed);
// choose the probability of a null result
double pi0 = 0.75;
// and the number of tests
int m = 1000;
/* p-values from the alternative distribution are assumed
* to be Beta distributed with parameters a and b
*/
double a = 1;
double b = 20;
// create p-values from a mixture distribution
std::map<int, double> pValues;
std::map<int, bool> Hs;
for ( int i = 0; i < m; ++i ) {
/* H = 0 means null hypothesis is true,
* H = 1 means alternative hypothesis is true
*/
bool H = gsl_ran_bernoulli(rng, 1-pi0);
Hs.emplace(i, H);
// p-values from the null model have a flat (uniform) distribution
double p = H ? gsl_ran_beta(rng, a, b) : gsl_ran_flat(rng, 0.0, 1.0);
pValues.emplace(i, p);
}
// transform p-values into q-values
auto qValues = computeQValues(pValues);
// call features with a q-value < 0.2 significant.
double qThreshold = 0.2;
int fds(0), tds(0), fns(0), tns(0); // count false/true discoveries/negatives
for ( auto pair : pValues ) {
int i = pair.first;
double p = pair.second;
double q = qValues.at(i);
bool H = Hs[i];
if ( q < qThreshold ) { // call significant
if ( H ) tds++;
else fds++;
} else { // call negative
if ( H ) fns++;
else tns++;
}
}
// compute the "realized" false discovery rate
double FDR = double(fds) / (tds + fds);
// print some results
std::cout << std::setw(20) << "true"
<< std::setw(6) << "false" << std::endl
<< std::setw(14) << "discoveries:" << std::setw(6) << tds
<< std::setw(6) << fds << std::endl
<< std::setw(14) << "negatives:" << std::setw(6) << tns
<< std::setw(6) << fns << std::endl;
std::cout << "realized FDR: " << FDR << std::endl;
// delete the random number generator
gsl_rng_free(rng);
return 0;
}

After compiling and running the program, the result should be something like:
$ g++ -O3 -std=c++11 qvalues_test.cpp -lgsl -lgslcblas -o qvaltest
$ ./qvaltest 1728
                true false
  discoveries:   143    32
    negatives:   712   113
realized FDR: 0.182857