Multidimensional scaling (MDS) is a data visualization technique in which the dimension of the data is reduced in a non-linear way. The data is represented as a N×N distance matrix (dij)ij, and N points xi in
a D dimensional space (typically D=2) are chosen such that the Euclidean distances ‖xi−xj‖
resemble the input distances dij "as good as possible".
In metric MDS, an objective function E(x)=∑1≤i<j≤N(dij−‖xi−xj‖)2 is defined that needs to be minimized. For different flavors of MDS, this objective function is defined differently. In order to minimize the objective function, e.g. the conjugate gradient descent method is used. This method requires that one calculates the gradient ∇E of the objective function. Of course, this is not so difficult is the case of metric MDS, but more difficult objective functions might require more effort. Enter Stan.
Stan uses automatic differentiation for Hamiltonian Monte Carlo, but Stan can also be used for maximum likelihood. Hence, if we can formulate the MDS problem in terms of a likelihood function, we can let Stan do all the work. The parameters of the model are the xi, the data is given by the distances dij. If we assume that given the parameters, the data is distributed as dij∼N(‖xi−xj‖,σ2), then maximizing the (log) likelihood is equivalent to minimizing the function E. The parameter σ2 is merely a nuisance parameter that needs to be estimated as well.
Secondly, the data that we use in the example below is highly censored. Many of the distances are missing, and some are right censored. In such a case MDS can be used to infer the missing distances, and not merely visualize the data. The data that is passed to Stan, therefore, is a list of edges, a list of distances, and a list of codes that determine the type of censoring.
Thirdly, as the title of this post suggests, we will use Stan to do some sort of Bayesian MDS. In this case, we will sample a collection of "maps" x from a posterior distribution, that gives information about the location of each point, but also the uncertainty of this location. In this case, the fact that we restrict the first D+1 points, comes back to bite us, as the uncertainty of these points will be different than the unrestricted points. Furthermore, it might be hard to compare the individual maps to one another, and for instance compute sensible mean locations of the points, as some maps may be "twisted" more than others. Therefore, we use the
The white squares denote the relative positions of the antisera in the "antigenic space", while the colored circles represent the antigens. The colors map to the years in which the infuenza strains were isolated.
Again, squares indicate antisera and colored circles the antigens. All the individual MCMC samples are represented by the grey dots. The MCMC samples for each antigen or antiserum are used to draw a two-dimensional error bar (i.e. ellipse) around the mean location.
A Python script for parsing the HI titer data, compiling and running the Stan model and drawing the maps is added below. For it to work, you will need to download the
In metric MDS, an objective function E(x)=∑1≤i<j≤N(dij−‖xi−xj‖)2 is defined that needs to be minimized. For different flavors of MDS, this objective function is defined differently. In order to minimize the objective function, e.g. the conjugate gradient descent method is used. This method requires that one calculates the gradient ∇E of the objective function. Of course, this is not so difficult is the case of metric MDS, but more difficult objective functions might require more effort. Enter Stan.
Stan uses automatic differentiation for Hamiltonian Monte Carlo, but Stan can also be used for maximum likelihood. Hence, if we can formulate the MDS problem in terms of a likelihood function, we can let Stan do all the work. The parameters of the model are the xi, the data is given by the distances dij. If we assume that given the parameters, the data is distributed as dij∼N(‖xi−xj‖,σ2), then maximizing the (log) likelihood is equivalent to minimizing the function E. The parameter σ2 is merely a nuisance parameter that needs to be estimated as well.
An implementation of MDS in the Stan programming language
Implementing MDS in Stan is fairly straightforward, but there are a few snags that we should be aware of. First, if x solves the MDS problem, then also any Euclidean transformation of x is a solution. Hence, the model as stated above has too many parameters. We solve this by fixing the first point at the origin, restricting the next point to a 1-dimensional half space, the third point to a 2-dimensional half space et cetera. The last N−D−1 points are unrestricted. In Stan, we can accomplish this by using a cholesky_factor_cov
matrix:
A positive-definite lower-trangular matrix. We then use the transformed parameters
block to concatenate the points together into a single matrix.
Secondly, the data that we use in the example below is highly censored. Many of the distances are missing, and some are right censored. In such a case MDS can be used to infer the missing distances, and not merely visualize the data. The data that is passed to Stan, therefore, is a list of edges, a list of distances, and a list of codes that determine the type of censoring.
Thirdly, as the title of this post suggests, we will use Stan to do some sort of Bayesian MDS. In this case, we will sample a collection of "maps" x from a posterior distribution, that gives information about the location of each point, but also the uncertainty of this location. In this case, the fact that we restrict the first D+1 points, comes back to bite us, as the uncertainty of these points will be different than the unrestricted points. Furthermore, it might be hard to compare the individual maps to one another, and for instance compute sensible mean locations of the points, as some maps may be "twisted" more than others. Therefore, we use the
generated quantities
block to center and rotate
(cf. PCA) the sampled maps.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data { | |
int<lower=1> N; // number of nodes | |
int<lower=0, upper=(N*(N-1))/2> E; // number of edges | |
int<lower=1, upper=N-1> D; // target dimension | |
int<lower=1, upper=N> edges[E,2]; // a list of edges (i1, i2) | |
vector<lower=0>[E] distances; | |
int<lower=0, upper=3> censoring[E]; // censoring of distances | |
/* 0: uncensored, | |
* 1: left-censored, | |
* 2: right-censored, | |
* 3: missing | |
*/ | |
} | |
parameters { | |
/* the first point is fixed at the origin (0) | |
* the next D points are stored in a lower triangular matrix (X1) | |
* the remaining points are given by a N-D-1 x D matrix (X2) | |
*/ | |
cholesky_factor_cov[D] X1; | |
matrix[D, N-D-1] X2; | |
real<lower=0> sigma; // nuisance parameter | |
} | |
transformed parameters { | |
// put all coordinates in a single matrix X | |
matrix[D, N] X; // [0, X1', X2] | |
X = append_col(rep_vector(0, D), append_col(X1', X2)); | |
} | |
model { | |
// try to keep sigma small | |
sigma ~ exponential(1); | |
// calculate the current distances (between the Xs) | |
for ( e in 1:E ) { | |
int i1; int i2; | |
real dist; | |
// compute the distance | |
i1 = edges[e][1]; | |
i2 = edges[e][2]; | |
dist = distance(col(X, i1), col(X, i2)); | |
// likelihood | |
if ( censoring[e] == 0 ) { // uncensored | |
distances[e] ~ normal(dist, sigma); | |
} else if ( censoring[e] == 1 ) { // left-censored | |
target += normal_lcdf(distances[e] | dist, sigma); | |
} else if ( censoring[e] == 2 ) { // right-censored | |
target += normal_lccdf(distances[e] | dist, sigma); | |
} else if ( censoring[e] == 3 ) { // missing | |
// do nothing | |
} | |
} | |
} | |
generated quantities { | |
vector[D] meanX; // the means of each MCMC sample | |
matrix[D, N] Xc; // centered positions | |
matrix[D, N] Xcr; // centered and rotated positions | |
// center X | |
for ( i in 1:D ) { | |
meanX[i] = mean(row(X, i)); | |
} | |
Xc = X - rep_matrix(meanX, N); | |
// rotate Xc | |
Xcr = eigenvectors_sym(tcrossprod(Xc))' * Xc; | |
} |
Example: Antigenic cartography of the influenza virus
An interesting application of MDS is antigenic cartography of the influenza virus. Influenza virus circumvents human antibody responses by continually evolving its surface proteins, in particular, hemagglutinin (HA). This is known as antigenic drift. In order to decide whether flu vaccines need to be updated, the hemagglutination inhibition (HI) assay is used to determine if the induced antibody response is still effective against future strains. The titers measured in the HI assay can be used to define "distances" between antisera and antigens. Using MDS, the antisera and antigens can be drawn into a "map", that shows the antigenic drift of the virus. This was done by Smith et al. in 2004. Conveniently, the data used for their map is available online. This table gives HI titers Hij of antigen i and antiserum j. A small titer corresponds to a large distance, which are defined as dij=log2(Hmax,j)−log2(Hij), where Hmax,j=maxkHkj. As an example, I recreated their antigenic map using the Stan model above, and a Python script below.The white squares denote the relative positions of the antisera in the "antigenic space", while the colored circles represent the antigens. The colors map to the years in which the infuenza strains were isolated.
Bayesian multidimensional scaling
For antigenic cartography of IAV, Bayesian MDS has been introduced by Bedford et al., who used multiple HI assay results per antigen/antiserum pair to incorporate the uncertainty of these measurements in their antigenic map. Moreover, they were able to use genetic and temporal information about the antigens (i.e. the RNA sequences of HA and their isolation dates) to inform the position of the antigens and antisera on the map. We will not go this far in this post, but since we have already formulated the MDS algorithm in Stan, we might as well make a "Bayesian" antigenic map. This can give some insight into the uncertainty of the positions of the antigens and antisera. This is not unlike the confidence areas as drawn by Smith et al. (the ellipsoid shapes). The result is given by the following figure.Again, squares indicate antisera and colored circles the antigens. All the individual MCMC samples are represented by the grey dots. The MCMC samples for each antigen or antiserum are used to draw a two-dimensional error bar (i.e. ellipse) around the mean location.
A Python script for parsing the HI titer data, compiling and running the Stan model and drawing the maps is added below. For it to work, you will need to download the
mds_model.stan
file and make a csv file called
baselinemap.csv
with the HI table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import csv | |
import pystan | |
import pickle | |
import numpy as np | |
import hashlib | |
import matplotlib | |
import matplotlib.pyplot as plt | |
################ some auxiliary functions ############### | |
def CachedStanModel(model_code, model_name=None): | |
""" | |
Function to cache compiled Stan models. See: | |
https://pystan.readthedocs.io/en/latest/ | |
""" | |
code_hash = hashlib.md5(model_code.encode('ascii')).hexdigest() | |
if model_name is None: | |
cache_fn = 'cached-model-{}.pkl'.format(code_hash) | |
else: | |
cache_fn = 'cached-{}-{}.pkl'.format(model_name, code_hash) | |
try: | |
sm = pickle.load(open(cache_fn, 'rb')) | |
except: | |
sm = pystan.StanModel(model_code=model_code) | |
with open(cache_fn, 'wb') as f: | |
pickle.dump(sm, f) | |
else: | |
print("Using cached StanModel") | |
return sm | |
def mkEllipse(Xt, scale=1): | |
""" | |
Function to make an ellipse that covers a point cloud. | |
The scale parameter can be used to rescale the ellipse. | |
""" | |
## for each point cloud, do a 'PCA' | |
meanXt = np.mean(Xt, axis=0) | |
## center the cloud to do eigen decomposition | |
Xtprime = np.array([X - np.mean(Xt, axis=0) for X in Xt]) | |
C = np.dot(Xtprime.transpose(), Xtprime) / Xt.shape[0] | |
eivals, U = np.linalg.eigh(C) | |
## compute angle (in degrees) | |
angle = np.arccos(U[0,0]) * np.sign(U[0,1]) * 360/(2*np.pi) | |
height = scale * 2 * np.sqrt(eivals[1]) | |
width = scale * 2 * np.sqrt(eivals[0]) | |
ell = matplotlib.patches.Ellipse(xy=meanXt, | |
width=width, height=height, angle=angle) | |
return ell | |
def getYear(lab): | |
""" | |
Single purpose function to get sampling year | |
from antigen/antiserum label. | |
""" | |
if lab[-3:]=="REC": | |
yr_str = lab[-5:-3] | |
else: | |
yr_str = lab[-2:] | |
yr = int(yr_str) | |
## a bodge for the Y2K bug | |
return yr + 2000 if yr < 50 else yr + 1900 | |
############### compile Stan model and HI data ############### | |
## open the test file containing the Stan model and compile it. | |
with open("mds_model.stan", 'r') as f: | |
mds_model = f.read() | |
sm = CachedStanModel(model_code=mds_model) | |
## import influenza HI data | |
with open("baselinemap.csv", 'r') as f: | |
## NB: this depends on how you saved the HI data | |
reader = csv.reader(f, delimiter=';') | |
table = [row for row in reader] | |
## the header starts at the first field | |
## (for you, this could be the second field) | |
sr_labs = ["SR_" + elt for elt in table[0] if elt != ''] | |
## the first column contains the antigen labels | |
ag_labs = ["AG_" + row[0] for row in table[1:]] | |
values = [row[1:] for row in table[1:]] | |
labs = sr_labs + ag_labs | |
nodes = list(range(1, len(labs)+1)) | |
labDict = {n : l for n, l in zip(nodes, labs)} | |
nodeDict = {l : n for n, l in zip(nodes, labs)} | |
## uncensored titers | |
uTiterDict = { | |
(nodeDict[l1], nodeDict[l2]) : np.log2(float(values[i2][i1])) | |
for i1, l1 in enumerate(sr_labs) | |
for i2, l2 in enumerate(ag_labs) | |
if values[i2][i1] != '*' and values[i2][i1][0] != '<' | |
} | |
## left-censored titers (i.e. right-centered distances) | |
cTiterDict = { | |
(nodeDict[l1], nodeDict[l2]) : np.log2(float(values[i2][i1][1:])) | |
for i1, l1 in enumerate(sr_labs) | |
for i2, l2 in enumerate(ag_labs) | |
if values[i2][i1] != '*' and values[i2][i1][0] == '<' | |
} | |
edges = list(uTiterDict.keys()) + list(cTiterDict.keys()) | |
edges.sort() | |
titerDict = { | |
e : w for e, w in list(uTiterDict.items()) + list(cTiterDict.items()) | |
} | |
## censoring: 0 means uncensored, 2 means right-censored | |
censorDict = {e : 0 for e in uTiterDict.keys()} | |
censorDict.update({e : 2 for e in cTiterDict.keys()}) | |
## find the maximum titer for each antiserum | |
maxTiterDict = { | |
l : np.max([logH for (i1, i2), logH in uTiterDict.items() if labDict[i1]==l]) | |
for l in sr_labs | |
} | |
## the distance d_{ij} is defined as log_2(H_{max,j}) - log_2(H_{i,j}) | |
distanceDict = { | |
e : maxTiterDict[labDict[e[0]]] - w for e, w in titerDict.items() | |
} | |
distances = [distanceDict[e] for e in edges] | |
censoring = [censorDict[e] for e in edges] | |
data = { | |
'D' : 2, | |
'E' : len(edges), | |
'N': len(nodes), | |
'distances' : distances, | |
'censoring' : censoring, | |
'edges' : edges | |
} | |
############# use Stan to minimize the MDS error ############ | |
fit_opt = sm.optimizing(data=data) | |
Xs_opt = fit_opt["Xcr"] | |
optXts = Xs_opt.transpose() | |
fig = plt.figure(figsize=(5,8)) | |
ax1 = fig.add_subplot(111, aspect='equal') | |
optxs = [X[0] for X in optXts] | |
optys = [X[1] for X in optXts] | |
optxs_ag = [m for m, n in zip(optxs, nodes) if labDict[n][:2]=="AG"] | |
optys_ag = [m for m, n in zip(optys, nodes) if labDict[n][:2]=="AG"] | |
optxs_sr = [m for m, n in zip(optxs, nodes) if labDict[n][:2]=="SR"] | |
optys_sr = [m for m, n in zip(optys, nodes) if labDict[n][:2]=="SR"] | |
year_ag = [getYear(labDict[n]) for n in nodes if labDict[n][:2]=="AG"] | |
C = ax1.scatter(optxs_ag, optys_ag, s=10, c=year_ag, cmap='viridis', | |
linewidth=1, edgecolor='k', zorder=2) | |
ax1.scatter(optxs_ag, optys_ag, s=5, c=year_ag, cmap='viridis', | |
linewidth=0, zorder=3) | |
fig.colorbar(C, ax=ax1, shrink=0.3) | |
ax1.scatter(optxs_sr, optys_sr, s=10, c='w', marker='s', | |
linewidth=1, edgecolor='k', zorder=4) | |
ax1.set_ylabel("Antigenic dimension 1") | |
ax1.set_xlabel("Antigenic dimension 2") | |
fig.savefig("mds-iav-ag-cart.png", dpi=300, bbox_inches='tight') | |
################# now do some Bayesian MDS ################ | |
fit = sm.sampling(data=data, chains=1, iter=1000, warmup=500) | |
la = fit.extract(permuted=True) | |
Xs = la['Xcr'] | |
Xts = [np.array([Xs[i].transpose()[j] for i in range(len(Xs))]) | |
for j in range(len(nodes))] | |
meanXts = [np.mean(Xt, axis=0) for Xt in Xts] | |
ells = [mkEllipse(Xt, scale=1) for Xt in Xts] | |
fig = plt.figure(figsize=(5,8)) | |
ax1 = fig.add_subplot(111, aspect='equal') | |
meanxs = [meanXt[0] for meanXt in meanXts] | |
meanys = [meanXt[1] for meanXt in meanXts] | |
meanxs_ag = [m for m, n in zip(meanxs, nodes) if labDict[n][:2]=="AG"] | |
meanys_ag = [m for m, n in zip(meanys, nodes) if labDict[n][:2]=="AG"] | |
meanxs_sr = [m for m, n in zip(meanxs, nodes) if labDict[n][:2]=="SR"] | |
meanys_sr = [m for m, n in zip(meanys, nodes) if labDict[n][:2]=="SR"] | |
year_ag = [getYear(labDict[n]) for n in nodes if labDict[n][:2]=="AG"] | |
C = ax1.scatter(meanxs_ag, meanys_ag, s=10, c=year_ag, cmap='viridis', | |
linewidth=1, edgecolor='k', zorder=2) | |
ax1.scatter(meanxs_ag, meanys_ag, s=5, c=year_ag, cmap='viridis', | |
linewidth=0, zorder=3) | |
fig.colorbar(C, ax=ax1, shrink=0.3) | |
ax1.scatter(meanxs_sr, meanys_sr, s=10, c='w', marker='s', | |
linewidth=1, edgecolor='k', zorder=4) | |
## define a colormap to color the ellipses | |
cmap = matplotlib.cm.get_cmap('viridis') | |
norm = matplotlib.colors.Normalize(vmin=np.min(year_ag), vmax=np.max(year_ag)) | |
## draw ellipses | |
for n, ell in enumerate(ells): | |
ax1.add_artist(ell) | |
ell.set_clip_box(ax1.bbox) | |
ell.set_alpha(0.5) | |
lab = labDict[n+1] | |
if lab[:2]=="AG": | |
ell.set_facecolor(cmap(norm(getYear(lab)))) | |
else: | |
ell.set_facecolor("darkgray") | |
ell.set_linewidth(0) | |
## plot individual samples | |
for Xt in Xts: | |
xs = [x[0] for x in Xt] | |
ys = [x[1] for x in Xt] | |
ax1.scatter(xs, ys, s=0.5, color='lightgray', alpha=0.3, zorder=1) | |
ax1.set_ylabel("Antigenic dimension 1") | |
ax1.set_xlabel("Antigenic dimension 2") | |
fig.savefig("bmds-iav-ag-cart.png", dpi=300, bbox_inches='tight') |