Next Article in Journal
Prior Elicitation, Assessment and Inference with a Dirichlet Prior
Next Article in Special Issue
Symmetries and Geometrical Properties of Dynamical Fluctuations in Molecular Dynamics
Previous Article in Journal
EXONEST: The Bayesian Exoplanetary Explorer
Previous Article in Special Issue
Reliable Approximation of Long Relaxation Timescales in Molecular Dynamics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparing Markov Chain Samplers for Molecular Simulation

Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
*
Author to whom correspondence should be addressed.
Current address: School of Mathematical and Statistical Sciences, Arizona State University, Phoenix, Tempe, AZ 85287, USA.
Entropy 2017, 19(10), 561; https://doi.org/10.3390/e19100561
Submission received: 6 September 2017 / Revised: 17 October 2017 / Accepted: 19 October 2017 / Published: 21 October 2017
(This article belongs to the Special Issue Understanding Molecular Dynamics via Stochastic Processes)

Abstract

:
Markov chain Monte Carlo sampling propagators, including numerical integrators for stochastic dynamics, are central to the calculation of thermodynamic quantities and determination of structure for molecular systems. Efficiency is paramount, and to a great extent, this is determined by the integrated autocorrelation time (IAcT). This quantity varies depending on the observable that is being estimated. It is suggested that it is the maximum of the IAcT over all observables that is the relevant metric. Reviewed here is a method for estimating this quantity. For reversible propagators (which are those that satisfy detailed balance), the maximum IAcT is determined by the spectral gap in the forward transfer operator, but for irreversible propagators, the maximum IAcT can be far less than or greater than what might be inferred from the spectral gap. This is consistent with recent theoretical results (not to mention past practical experience) suggesting that irreversible propagators generally perform better if not much better than reversible ones. Typical irreversible propagators have a parameter controlling the mix of ballistic and diffusive movement. To gain insight into the effect of the damping parameter for Langevin dynamics, its optimal value is obtained here for a multidimensional quadratic potential energy function.

1. Summary

Thermodynamic and structural properties of a molecular system can be precisely defined as ensemble-dependent expectations of functions of its configuration. The calculation of such expectations is feasible only with the use of Markov chain Monte Carlo (MCMC) methods or approximations thereof. Considered here are sampling propagators that do not compromise the Markov property. Included are unbiased samplers that conditionally accept proposed moves and biased samplers that unconditionally accept such moves, in particular, discretized stochastic dynamics. Many such sampling propagators are proposed in the literature, and, in virtually all cases, experiments are conducted to substantiate claims of superiority. Too often though, a good metric is not used to measure the computational cost of a propagator. The aim of this article is threefold: first, to explore some practicalities related to measuring the efficiency of a propagator; second, to highlight the superior efficiency of irreversible propagators, namely, those that do not satisfy detailed balance; and third, to provide some insight into the optimal mix of diffusive and ballistic movement for Langevin dynamics.
Let ρ Q ( q ) , q = [ q 1 , q 2 , , q ν ] T denote the probability density function corresponding to the ensemble of interest. This function is assumed to be known only up to a multiplicative factor. An observable is an expectation E [ u ( Q ) ] = u ( q ) ρ Q ( q ) d q for some “preobservable” u ( q ) . This can be estimated by the mean U ¯ N = ( 1 / N ) n = 0 N 1 u ( Q n ) , where the random values Q n are obtained from a Markov chain Q 0 Q 1 Q N 1 that samples from ρ Q ( q ) . Note the use here of upper case to denote random variables.
In practice, sampling performance is improved if the configuration variables q are augmented with auxiliary variables p , e.g., momenta, yielding phase space variables z = ( q , p ) . The probablity density is extended to ρ ( q , p ) so that
ρ ( q , p ) d p = ρ Q ( q ) ,
and an MCMC scheme is constructed to produce a chain Z 0 Z 1 Z N 1 , where Z = ( Q , P ) .
The samples from the chain tend to be highly correlated, and this greatly reduces the convergence rate as N + . As explained in Section 2, the variance of an estimate for E [ u ( Z ) ] is
Var [ U ¯ N ] = τ N Var [ u ( Z ) ] + O ( 1 N 2 )
where τ is the integrated autocorrelation time (IAcT) for u ( z ) . The effective sample size is therefore N / τ , and the appropriate metric for evaluating a propagator is the effective sample size divided by the computing time.
In a great many practical simulations, the effective sample size is probably close to zero. One can disagree on the significance of such simulations [1]. In any case, for the comparison of sampling algorithms, it is possible to choose molecular systems, restrained if necessary, for which it is feasible to attain a decent effective sample size.
Often, the spectral gap is cited as the relevant quantity. To understand its role, it is helpful to express ideas in a direct way as in Refs. [2,3]. As detailed in Section 2, introduce a forward transfer operator F to express the ratio ρ n / ρ in terms of ρ n 1 / ρ , where ρ n ( z ) is the probablity density for Z n . Let F 0 = F E where E u denote the function with constant value E [ u ( Z ) ] . Assume that the operator F 0 has its spectrum strictly inside the unit circle, as it does in practice. The error in ( 1 / N ) n = 0 N 1 ρ n ( z ) can be shown [1] to be “proportional to” ( 1 F 0 ) 1 and therefore to the reciprocal of the spectral gap | 1 λ 2 | , where λ 2 is a nonunit eigenvalue of F nearest 1.
Estimates of the IAcT are obtained by summing the terms of the autocorrelation function, which is constructed from autocovariances of increasing time lags normalized by the (lag zero) covariance. Each term contributes a roughly equal statistical error but a signal that decays as the lag time increases. Therefore, in practice, the terms are weighted by a function called a lag window. The lag window must be tailored to the autocorrelation function, and choosing a suitable lag window is very difficult, as mentioned in Section 2.1.
Reliable estimates of the IAcT are impractical in general. Section 3 reviews the concept of quasi-reliability, which aims to enforce a necessary condition for reliable estimates. Informally, the goal is to ensure good coverage of those “parts” of phase space that has been explored, to reduce the possibility of missing an opening to an unexplored part of phase space. More precisely, for an arbitrary subset of phase space, we ask that the proportion of samples in that subset differ from its expectation by no more than some tolerance tol , with, say, 95% confidence. This is shown to be true if the IAcT τ for any preobservable u ( z ) satisfies τ tol 2 N . The maximum IAcT τ max is the greatest eigenvalue of
G = 1 E + F 0 ( 1 F 0 ) 1 + F 0 ( 1 F 0 ) 1 ,
where denotes the adjoint with respect to the inner product
v , u = v ( z ) ¯ u ( z ) ρ ( z ) d z .
For a reversible propagator, where F = F , τ max is 1 less than twice the reciprocal of the spectral gap. However, for an irreversible propagator, it can be much smaller, as demonstrated by a simple example in Section 4.1, or larger as in Section 5. As a practical algorithm, it is suggested to estimate the maximum IAcT by discretizing the space of preobservables. Consideration of a quadratic energy potential suggests using a linear combination of phase space variables (and possibly quadratic terms in these variables).
The idea of seeking the preobservable that maximizes the IAcT is suggested already in Ref. [4], which considers a set of indicator functions as preobservables and uses the greatest IAcT of these to assess sampling thoroughness. In general, maximizing over a linear combination of preobservables can lead to a much larger result than taking the maximum of them individually, due to correlations that might exist between different preobservables. This does not necessarily apply, however, to those considered in Refs. [1,4].
Section 3.1 notes that that typical irreversible propagators, termed “quasi-reversible” here, have a forward transfer operator F = R F ¯ , where each of F ¯ and R is reversible and R 2 = 1 . For such propagators, the estimation of τ max simplifies somewhat.
Theoretical results [5] indicate that adding irreversibility reduces the autocorrelation times of observables. Section 4 gives a couple of very elementary examples illustrating the dramatic increase in τ max if an irreversible propagator F is replaced by its reversible “counterpart” 1 2 ( F + F ) .
Discretized Langevin dynamics is a particularly effective general-purpose propagator. Unfortunately, one must specify a value for the damping coefficient γ . Section 5 analyzes τ max for a quadratic potential and obtains an optimal value for the coefficient, namely, the value ( 3 / 8 ) 1 / 2 = 0.612 times the critical damping coefficient for lowest frequency ω 1 . This value is such that the lowest frequency mode is moderately underdamped, with higher frequencies increasingly underdamped.

2. Preliminaries

Assuming that Z 0 has probability density ρ ( z ) , the variance of the estimate U ¯ N is exactly
Var [ U ¯ N ] = 1 N Var [ u ( Z ) ] 1 + 2 k = 1 N 1 1 k N C ( k ) C ( 0 )
where the autocovariances
C ( k ) = E u ( Z 0 ) E [ u ( Z 0 ) ] u ( Z k ) E [ u ( Z k ) ] .
The limit N + gives Equation (1) where the integrated autocorrelation time
τ = 1 + 2 k = 1 + C ( k ) C ( 0 ) .
As an example of augmenting configuration space, consider ρ ( q , p ) exp ( β ( V ( q ) ) + 1 2 p T M 1 p ) ) , where p = [ p 1 , p 2 , , p ν ] T . A good propagator for this is the BAOAB integrator [6] for Langevin dynamics, whose equations are
d Q t = M 1 P t d t , d P t = F ( Q t ) d t γ P t d t + 2 γ β M 1 / 2 d W t ,
where M is a matrix chosen to compress the range of vibrational frequencies, F ( q ) = V ( q ) , M 1 / 2 M 1 / 2 T = M , and W t is a vector of independent standard Wiener processes. Each step of the integrator consists of the following substeps:
  • B: P n = P n 1 + 1 2 Δ t F ( Q n 1 ) ,
  • A: Q n = Q n 1 + 1 2 Δ t M 1 P n ,
  • O: P n = exp ( γ Δ t ) P n + 1 exp ( 2 γ Δ t ) β 1 / 2 M 1 / 2 R n ,
  • A: Q n = Q n + 1 2 Δ t M 1 P n ,
  • B: P n = P n + 1 2 Δ t F ( Q n ) ,
where R n is a vector of independent standard Gaussian random variables. The samples generated from this process are shown [7] to be those from a distribution that differs from the correct one by only O ( Δ t 2 ) . The special choice γ = 1 / ( 2 Δ t ) is the Euler–Leimkuhler–Matthews integrator [6] for Brownian dynamics with step size δ t = Δ t 2 / 2 . Remarkably, the invariant density of this integrator differs from the correct one by only O ( δ t 2 ) . This integrator can be expressed as a Markov chain Q 1 Q 2 Q N 1 in configuration space, with propagator
  • Q n = Q n + 1 2 2 δ t β 1 / 2 M 1 / 2 T R n ,
  • Q n + 1 = Q n + δ t M 1 F ( Q n ) + 1 2 2 δ t β 1 / 2 M 1 / 2 T R n ,
which is a discretization of Brownian dynamics
d Q t = M 1 F ( Q t ) d t + 2 β M 1 / 2 T d W t .
The desired samples { Q n } are available as part of the process (and, as a theoretical observation, they can be recovered from the Markov chain { Q n } alone, by eliminating R n in the two equations above and solving for Q n ).
For any MCMC propagator, the forward transfer operator is defined so that
u n = F u n 1
where u n ( z ) = ρ n ( z ) / ρ ( z ) and ρ n ( z ) is the probability density for Z n . In particular,
F u n 1 ( z ) = 1 ρ ( z ) ρ ( z | z ) u n 1 ( z ) ρ ( z ) d z
where ρ ( z | z ) is the transition probablity for the chain. The operator F has an eigenfunction φ 1 ( z ) 1 for eigenvalue λ 1 = 1 .
A reversible propagator is one that satisfies detailed balance, which means that
ρ ( z | z ) ρ ( z ) = ρ ( z | z ) ρ ( z ) .
Detailed balance is equivalent to F = F , where the adjoint F is defined by the condition that F v , u = v , F u for arbitrary u ( z ) and v ( z ) . The BAOAB integrator is not reversible as a sampling propagator, except for the special case γ = 1 / ( 2 Δ t ) ; however, a scheme consisting of a fixed number of BAOAB steps followed by a momenta flip is reversible. The unmodified BAOAB integrator is, however, in a class of “quasi-reversible” integrators introduced in Section 3.1.
As a model problem for Brownian dynamics, given by Equation (5), consider F ( q ) = m ω 2 q . (The absence of boldface indicates scalar quantities.) Changing variables Q t = ( m β ) 1 / 2 Q t and dropping the prime gives the simple stochastic differential equation
d Q t = ω 2 Q t d t + 2 d W t ,
where W ( t ) denotes a standard Wiener process. A perfect realization of Q ( t ) at discrete points is obtained by the reversible propagator
Q n = exp ( ω 2 δ t ) Q n 1 + 1 exp ( 2 ω 2 δ t ) 1 ω R n .
The probablity density ρ ( q , t ) for Q ( t ) satisfies the Fokker–Planck equation ( / t ) ρ = ( / q ) ( ω 2 q ρ ) + ( 2 / q 2 ) ρ . Writing ρ ( q , t ) = u ( q , t ) ρ ( q ) gives ( / t ) u = ω 2 q ( / q ) u + ( 2 / q 2 ) u . The operator on the right-hand side has eigenfunctions u ( q ) = He k ( ω q ) with eigenvalues k ω 2 for k = 0 , 1 , . The modified version of the Hermite polynomial of degree k is given by He k ( x ) = ( 1 ) k exp ( x 2 / 2 ) ( d n / d x n ) exp ( x 2 / 2 ) . The forward transfer operator F for the propagator defined by Equation (7) has these same eigenfunctions and has eigenvalues λ k + 1 = exp ( k ω 2 δ t ) for k = 1 , 2 , . The spectral gap is 1 exp ( ω 2 δ t ) . In the multidimensional case with normal mode frequencies 0 < ω 1 ω 2 ω ν , the time step δ t is some fraction, say 1 2 , of 1 / ω ν 2 and the spectral gap is very nearly 1 2 ω 1 2 / ω ν 2 . To see the applicability to practical numerical integrators, consider the Euler–Leimkuhler–Matthews discretization of Equation (6):
Q n = ( 1 ω 2 δ t ) Q n 1 + ( 1 1 2 ω 2 δ t ) 2 ω 2 δ t 1 ω R n .
Comparing Equations (8) and (7) and equating the coefficients of the two terms on their right-hand sides enables Equation (8) to be written in the form of Equation (7) with modified values for ω and δ t . In particular, the modified value of exp ( ω 2 δ t ) is 1 ω 2 δ t , so spectral gap is exactly ω 2 δ t . In the multidimensional case, where ω 1 2 δ t 1 , the spectral gap for the discrete stochastics is very nearly that of the continuous stochastics.

2.1. Estimating Integrated Autocorrelation Time

Suggested [8] as a covariance estimate is the quantity
C N ( k ) = 1 N n = 0 N k 1 ( u ( z n ) U ¯ N ) ( u ( z n + k ) U ¯ N ) ,
with justification in Ref. [9] (pp. 323–324).
The use of all possible terms C N ( k ) / C N ( 0 ) in Equation (3) to estimate the IAcT does not converge in the limit N + , so, in practice, a lag window w ( k ) is used to increasingly damp terms as the noise to signal ratio increases:
τ 1 + k = 1 N 1 w ( k ) C N ( k ) C N ( 0 ) .
An interesting algorithm called acor for estimating the IAcT is available on the web [10]. Estimating the IAcT can be quite difficult, and acor can give unsatisfactory results. An attempt to improve it [1] is at best marginally successful. For reversible methods, there are properties of the autocorrelation function that may be useful for improving estimates of it [8].

3. Quasi-Reliable Estimates of Sampling Thoroughness

The idea of quasi-reliability is to require that the sample size N be large enough that, with say 95% confidence, the estimated probability of any subset Ω of phase space differs by no more than tol from its correct value. More specifically, for any subset Ω of phase space, an estimate 1 Ω ¯ , N of E [ 1 Ω ( Z ) ] = Pr ( Z Ω ) must satisfy
Var [ 1 Ω ¯ , N ] 1 4 tol 2 .
Because
Var [ 1 Ω ¯ , N ] τ Ω 1 N Var [ 1 Ω ( Z ) ] 1 4 N τ Ω ,
where τ Ω is the IAcT for 1 Ω , it is enough that
1 4 N τ Ω 1 4 tol 2 for all  Ω .
For molecular simulations, this requires that only those conformations or clusters of conformations having a probability greater than tol be sampled.
In practice, molecular systems have many symmetries, which dramatically reduces the amount of sampling needed. For example, water molecules are generally considered interchangeable as are many subsets of atoms on a given molecule, e.g., the 2 hydrogen atoms of any water molecule. More formally, there are permutations P of the variables z such that ρ ( P z ) = ρ ( z ) and that P 1 F P = F where ( P u ) ( z ) = u ( P z ) . For such symmetries P, the quasi-reliablity requirement considers only those Ω for which 1 Ω ( P z ) = 1 Ω ( z ) .
It is helpful to express the IAcT in terms of the forward transfer operator. It can be shown that
E [ v ( Z 0 ) u ( Z k ) ] = F k v , u .
and, in particular,
C ( k ) = E [ u ( Q 0 ) u ( Q k ) ] E [ u ( Q 0 ) ] E [ u ( Q k ) ] = F k u , u E u , u = ( 1 E ) u , u , for  k = 0 , F 0 k u , u , for  k 1 ,
using the fact that E F 0 = F 0 E = 0 . The integrated autocorrelation time can be rewritten as
τ = C ( 0 ) + 2 k = 1 + C ( k ) C ( 0 ) = ( 1 E + 2 k = 1 + F 0 k ) u , u ( 1 E ) u , u = G u , u ( 1 E ) u , u
where
G = 1 E + F 0 ( 1 F 0 ) 1 + ( F 0 ( 1 F 0 ) 1 ) ,
which is a self-adjoint operator.
For a reversible propagator, for which F and hence F 0 is self adjoint, an arbitrary preobservable u is in many cases expressible as a linear combination of the eigenfunctions φ k ( z ) of F 0 , corresponding to eigenvalues 1 > λ 2 λ 3 > 1 . (For a more rigorous treatment, see Ref. [8] (Section 2).) The IAcT for u is then simply a weighted average, of the values
1 + 2 λ k 1 λ k = 1 + λ k 1 λ k
with weights φ k , u 2 / ( φ k , φ k u , u ) . This is maximized for u = φ 2 , since ( 1 + λ 2 ) / ( 1 λ 2 ) is the largest of these values. Note that, for λ 2 negative, τ could be much less than 1. Having τ < 1 may appear paradoxical until it is recognized that Equation (1) is simply an asymptotic approximation for N + .
For the simple example with F ( q ) = ω 2 q , the eigenfunction φ 2 ( q ) = 2 ω q . The indicator function 1 Ω that is richest in φ 2 ( q ) is the one with Ω = [ 0 , + ] , for which the first weight is 2 / π . This means that the IAcT for 1 Ω is at least 2 / π of the maximum IAcT. For a multimodal distribution, the eigenfunction φ 2 ( q ) corresponding to the subdominant eigenvalue λ 2 [2,3] is closer than is the function 2 ω q to an indicator function modulo a constant. Therefore, little is lost and simplicity is gained, if we use the maximun IAcT over all observables satisfying the symmetries instead of just indicator functions:
τ max = sup u W G u , u ( 1 E ) u , u
where
W = { u = u ( z ) | u ( P z ) = u ( z )  for symmetries  P }
is our set of preobservables. This can be simplified to
τ max = sup u W G u , u u , u .
To see this, note that the same supremum is obtained for both objective functions if the function u is restricted so that E u = 0 and that for such a function u the two objective functions are equal.
For the simple example of Equation (8), the IAcT is maximized by u ( q ) = q . In the case of a multidimensional Gaussian distribution, the maximum occurs for linear combination a T q of q where a is the eigenvector of the Hessian of V ( q ) corresponding to its smallest eigenvalue ω 1 2 . For multimodal distributions in one dimension, it appears that the maximizing preobservable φ 2 ( q ) is qualitatively similar to q in the sense that it is monotonic [2,3].
As is customary when seeking an unknown function, one considers a finite linear combination u ( q ) = a T u ( q ) of basis functions u i W where the a i are chosen to maximimize the IAcT. The number of basis functions is limited by computational cost. The result of Section 5 hints that the ideal number might be O ( ν 2 ) . From Equation (10), the autocovariance for such u ( q ) is
C ( k ) = a T C k a
where
C k = ( 1 E ) u , u T , for  k = 0 , F 0 k u , u T , for  k 1 .
Consequently,
τ max max a a T K a a T C 0 a where  K = C 0 + 2 k = 1 + C k ,
which is a solution of the generalized eigenvalue problem
1 2 ( K + K T ) a = C 0 a τ .
The approximation of Equation (14) improves with the number of basis functions.
Without information about the actual distribution ρ ( z ) , a general choice for basis functions might be linear polynomials, which are the “subdominant” eigenfunctions for a Gaussian distribution. Simple examples in Ref. [1] (consisting of a mixture of two Gaussians, a one-node neural “network”, and logistic regression) demonstrate that the use of linear basis functions can yield an IAcT much greater than that of a preobservable of “interest”. For molecular simulation, it is clear that instead of atomic coordinates, a better choice of a basis function is the distance between two atoms, each of which is uniquely distinguishable. In particular, for a protein, one might choose α -carbons distributed along the backbone chain of a protein. For further suggestions, consult Ref. [11], which considers the automatic construction of indicator functions for estimating τ max , based on the dynamics of the propagator.

3.1. Quasi-Reversible Propagators

As stated above, the BAOAB integrator would be reversible if the final B substep were modified to include a momenta flip, i.e.,
  • BR: P n = ( P n + 1 2 Δ t F ( Q n ) ) .
However, reversing direction runs counter to intuition. Additionally, empirical evidence [12] favors irreversible propagators. If the flipped BAOAB integrator is followed by another momenta flip, the result is the original irreversible BAOAB integrator. More generally, a sampler might be said to be “quasi-reversible” if its forward transfer operator F = R F ¯ where each of F ¯ and R is reversible and R 2 = 1 . For a momenta flip, the operator R is defined by ( R u ) ( q , p ) = u ( q , p ) .
Quasi-reversible propagators are special in that their covariance matrices satisfy a special property if basis functions u i are chosen so that
R u = D u where  D = diag ( I , I )
and I and I are identity matrices of possibly different dimension. Such basis functions are easy to construct since any function u can be written as a sum of its “even part” 1 2 ( u + R u ) and its “odd part” 1 2 ( u R u ) . In particular, using the fact that R E = E R ,
C k = F k u , u T E u , u T = u , ( F ¯ R ) k u T u , E u T = R u , ( R F ¯ ) k R u T R u , R E u T = D u , F k u T D D u , E u T D = D C k T D .
The matrix C k thus partitions into four blocks. The symmetric part of C k consists of the two diagonal blocks, and the skew-symmetric part consists of the two off-diagonal blocks. Empirical estimates of C k lack these symmetry properties. The expected symmetries provides twice as many samples for estimating the sampling error in the off-diagonal elements. Additionally, since the IAcT requires only the symmetric part of C k , it is unnecessary to compute the off-diagonal blocks, and the eigenvalue problem decouples into two smaller problems. In addition, it follows that the maximizing linear combination is either a linear combination of the even functions or of the odd functions. For molecular dynamics at least, velocities “relax” much more quickly than positions, so it seems that the long time scales are present only in the position coordinates, so only the ( 1 , 1 ) block might be computed.

4. Irreversible Samplers and Their Superiority

Uniform sampling—if it could be used—converges like O ( 1 / N ) , which is superior to the O ( 1 / N ) convergence of random sampling. It would be desirable to combine them by promoting near-uniform sampling within layers of near-constant energy and using diffusion to move among energy layers.

4.1. A Very Simple Example

The following very simple example is a discrete analog of Example 2.8 of Ref. [5] and similar to an example of Ref. [13]. Assuming probabilities are represented as row vectors, let F be an n by n probability transition matrix given by
F i j = θ , if  i = j , 1 θ , if  i = j + 1 mod n , 0 , otherwise ,
where 0 < θ < 1 . The stationary probability vector is ( 1 / n ) [ 1 , 1 , , 1 ] . The inner product for two row vectors u T , v T is v T , u T = v T u , and the adjoint of a transition matrix is its transpose. Since F is a circulant matrix, both it and its adjoint have as eigenvectors u k T = [ 1 , ζ k 1 , , ( ζ k 1 ) n 1 ] , where ζ denotes the nth root of unity exp ( 2 π i / n ) . A straightforward, though lengthy, calculation shows that
u k T F 0 = ( θ + ( 1 θ ) ζ k 1 ) u k T and u k T F 0 T = ( θ + ( 1 θ ) ζ 1 k ) u k T ,
for k = 2 , 3 , , n , and
u k T G = θ 1 θ u k T .
Therefore,
2 / | 1 λ 2 | = 2 / ( ( 1 θ ) | 1 ζ | ) = 1 / ( ( 1 θ ) sin ( π / n ) ) ,
but
τ max = θ / ( 1 θ ) .
This a most elementary illustration of the assertion [5] that “the asymptotic variance for every observable can be made as small as desired under large irreversible drift”. If θ 1 , the IAcT is likewise much less than 1 despite the very strong correlation between successive values of the state j. The reason for this is that the IAcT is a measure of sample quality not independence. The measure τ simply says that N samples are as good as N / τ independent random samples. The F operator encourages a uniform sampling, which is better than uncorrelated random samples. However, as shown in the example that follows, such a dramatic difference between (one less than twice) the reciprocal of the spectral gap and the IAcT does not hold when sampling is inhibited more by energy than by entropy barriers. A reversible propagator can be formed from this simple example by using instead the symmetric part of its propagator: F rv = 1 2 ( F + F T ) . This time
u k T F 0 rv = ( θ + 1 2 ( 1 θ ) ( ζ k 1 + ζ 1 k ) ) u k T , for  k = 2 , 3 , , n ,
and
u k T G rv = 4 ( 1 θ ) ( 2 ζ k 1 ζ 1 k ) 1 u k T .
Therefore
2 / | 1 λ 2 | = 4 / ( ( 1 θ ) ( 2 ζ ζ 1 ) ) = 1 / ( ( 1 θ ) sin 2 ( π / n ) ) ,
and
τ max = 1 / ( ( 1 θ ) sin 2 ( π / n ) ) 1 .
Clearly, the irreversible propagator is much better for n 1 .

4.2. A Very Simple Example with a Barrier

Consider the 2 n by 2 n transition matrix given by
F = 0 1 ε ε 1 0 1 0 ε 0 1 ε 1 0 1 0 .
If ε were zero, the eigenvalue 1 would have multiplicity 2, with orthogonal eigenvectors given by the vector of all ones and by f T = [ 1 , 1 , , 1 , 1 , 1 , , 1 ] . For small ε > 0 , it is expected that the eigenvector corresponding to the subdominant eigenvalue would be a perturbation of f T . Indeed, it happens that f T F n = ( 1 2 ε ) f T . This also holds for ( F n ) T , so the IAcT for F n is 1 / ε 1 . This suggests a value of about n ( 1 / ε 1 ) for the IAcT of F . This is corroborated by numerical computation, as shown in Table 1a.
The excess IAcT due to making the propagator reversible is given by Table 1b.It appears to grow quadratically with n, consistent with the diffusive nature of the propagator. Summing pairs of entries from the two tables shows that for this example the advantage of irreversibility depends on the relative importance of entropy barriers to energy barriers, which is consistent with intuition from molecular dynamics.

5. Optimal Langevin Damping for a Model Problem

To analyze the effect of the damping parameter γ on the IAcT for Langevin dynamics, given by Equation (4), consider the standard model problem F ( q ) = m ω 2 q . Changing variables Q t = ( β m ) 1 / 2 Q t , P t = β 1 / 2 m 1 / 2 P t , and dropping the primes gives
d Q t = P t d t , d P t = ω 2 Q t d t γ P t d t + 2 γ d W t .
Assume exact integration with step size Δ t .
The analysis is rather lengthy, so for the benefit of the reader who wishes to omit it, the discussion and conclusions are given here:
  • Reaching precise conclusions is difficult for most of the analysis unless ω Δ t is bounded above by some constant of order 1, which seems to be satisfied in practice. This assumption underlies the statements that follow.
  • The spectral gap is an increasing function of ω , so for a multidimensional quadratic potential energy, the value of γ that maximizes the spectral gap depends on the lowest frequency ω 1 .
  • The spectral gap is maximized for γ 2 ω , corresponding to an underdamped system, for which the spectral gap is ω Δ t + O ( Δ t 2 ) .
  • The eigenfunctions of the operator G can be partitioned into eigenspaces P k , k = 0 , 1 , , where P k is a linear combination of k + 1 specific polynomials of degree k in ω q and p. The greatest eigenvalue of G is τ max = max k τ max ( k ) where τ max ( k ) is the maximum IAcT over P k .
  • Figure 1 shows that, for fixed Δ t and γ , the value τ max ( k ) is an increasing function of ω , at least for k = 1 , 2 , 3 , 4 . Hence, as for the spectral gap, it is the lowest frequency ω 1 that dictates the maximum τ .
  • Figure 2 indicates that, for fixed Δ t and ω , the maximizing τ is either τ max ( 1 ) or τ max ( 2 ) , depending on the value of γ . The optimal damping coefficient is γ = ( 6 / 2 ) ω , for which τ max ( 1 ) = τ max ( 2 ) = 6 / ( ω Δ t ) .
  • For the preobservable ω q , and, indeed, for any odd polynomial, the IAcT τ becomes arbitrary small as γ goes to zero. This does not mean, however that the variance goes to 0, because Equation (1) is an asymptotic result, and the IAcT is a prefactor for the leading order 1 / N term in the expression for the variance. The next order 1 / N 2 term would dominate if γ were very small. An order 1 / N 2 error is characteristic of uniform sampling, which would be the consequence of nearly ballistic movement. In addition, odd polynomials are special in that their expectation is independent of total energy, so it matters not that barely diffusive movement samples different values of total energy only at a very slow rate.
  • The eigenfunction for τ max ( 2 ) is a specific linear combination of ω 2 q 2 1 and p 2 1 . For quadratic polynomials, the total energy does affect its expected value, which is why it is necessary that γ be large enough to sample different energies on a reasonable time scale.

5.1. The Forward Transfer Operator

The forward transfer operator is
F u = ρ 1 e Δ t L K ( ρ u )
where
ρ ( q , p ) exp ( 1 2 p 2 1 2 ω 2 q 2 ) .
and L K is the Fokker–Planck operator [14] (Equations (10.1)–(10.3), (10.9))
L K f = p q f + ω 2 q p f + γ p ( p f ) + γ 2 p 2 f .
Therefore,
F = e Δ t L
where
L f = 1 ρ L K ( ρ f ) = p q f + ω 2 q p f γ p p f + γ 2 p 2 f .
Using the relation L f = 1 ρ L K ( ρ f ) , it is easy to show that the adjoint
L f = p q f ω 2 q p f γ p p f + γ 2 p 2 f .

5.2. Gamma for Maximum Spectral Gap

Ref. [14] (Chapter 10, Equations (1), (2), (3b), (9), (22), (52), (60), (71), (72), (74), (77), (78), (82) and (83)) gives the eigenvalues of L K as
μ k , l = 1 2 ( k + l ) γ 1 2 ( k l ) δ , k , l = 0 , 1 , ,
with eigenfunctions
ψ k , l = ρ ( q , p ) 1 / 2 ( k ! l ! δ k + l ) 1 / 2 ( γ + B γ A ) k ( γ B + γ + A ) l ρ ( q , p ) 1 / 2 ,
where
γ ± = 1 2 ( γ ± δ ) and δ = γ 2 4 ω 2 ,
with the radical sign denoting the principal square root, and operators A , B defined by
A f = ω 1 q f + 1 2 ω q f , B f = p f + 1 2 p f .
The operator L has eigenfunctions ρ ( q , p ) 1 ψ k , l ( q , p ) and the same eigenvalues; F has these same eigenfunctions and has eigenvalues exp ( Δ t μ k , l ) . See Ref. [15] for a more general result.
For k + l = 1 , one has μ 0 , 1 = γ , μ 1 , 0 = γ + and
φ 0 , 1 = ( γ / δ ) 1 / 2 ( p + γ + q ) ρ ( q , p ) , φ 1 , 0 = ( γ + / δ ) 1 / 2 ( p γ q ) ρ ( q , p ) .
In the case γ ± = 1 2 γ , F has a double eigenvalue λ 2 = λ 3 = e ω Δ t but one eigenfunction φ 2 ( q , p ) = p ω q and one generalized eigenfunction φ 3 ( q , p ) = q , satisfying ( F λ 2 ) φ 3 = φ 2 . This results in behavior involving a linear combination of e ω t and t e ω t .
The eigenvalue nearest 1 depends on Δ t . For small enough Δ t , it is exp ( γ Δ t ) , and the spectral gap is ω Δ t + O ( Δ t 2 ) for γ 2 ω and ω Δ t ( 2 ω / γ ) ( 1 + ( 1 ( 2 ω / γ ) 2 ) 1 / 2 ) 1 + O ( Δ t 2 ) otherwise. To maximize the spectral gap, choose γ to be no greater than 2 ω , corresponding to an underdamped system.

5.3. Gamma for Maximum IAcT

To obtain τ max , one needs eigenelements for G , given by Equation (2) with F = exp ( Δ t L ) . Note that the subspace of polynomials of degree k is closed under application of L and L . Moreover, this holds separately for subspace P k of odd polynomials of degree k for k odd and for subspace P k of even polynomials of degree k for k even. It also applies to operators E , F , F , and G . Hence, the P k are eigenspaces for G . These eigenspaces can be further decomposed as follows. Let P k = P k for k < 2 , and let P k = P k P k 2 for k 2 . The claim is that P k is an eigenspace, of dimension k + 1 . To confirm this, it is enough to show that G u , v = 0 for any u P k , v P k 2 , which follows almost immediately, since G = G and G v P k 2 . Let u be a basis for P k chosen so there are even functions of p followed by odd functions of p. In particular,
u = [ ω q , p ] T , u = [ ω 2 q 2 1 , ω q p , p 2 1 ] T , u = [ ω 3 q 3 3 ω q , ( ω 2 q 2 1 ) p , ω q ( p 2 1 ) , p 3 3 p ] T , u = [ ω 4 q 4 6 ω 2 q 2 + 3 , ( ω 3 q 3 3 ω q ) p , ( ω 2 q 2 1 ) ( p 2 1 ) , ω q ( p 3 3 p ) , p 4 6 p + 3 ] T ,
for k = 1 , 2 , 3 , 4 , respectively. The polynomials in ω q and in p are modified versions He k of Hermite polynomials. We have L u = A u for some matrix of constants A, given by
A = 0 k ω ω γ ω k ω k γ .
(cf. Ref. [14], Equations (10.96) and (10.97) for L K .) We have E u = 0 and F 0 u = F u = exp ( Δ t A ) u . Therefore, from Equations (13) and (14), one has
C k = F 0 k u , u T = exp ( k Δ t A ) u , u T ,
and
K = coth ( 1 2 Δ t A ) u , u T = coth ( 1 2 Δ t A ) C 0 .
For each value of k, τ max is the maximum eigenvalue of
1 2 C 0 1 coth ( 1 2 Δ t A ) C 0 + coth ( 1 2 Δ t A ) T .
As explained in Section 3.1, if the basis functions are re-ordered so that those that are even functions of p precede those that are odd functions of p, the eigenvalue problem splits into two nearly equal parts.
For k = 1 , A = X Λ X 1 where Λ = diag ( γ , γ + ) and
X = ω ω γ γ + .
The covariance matrix C 0 = u , u T = diag ( 1 , 1 ) , and matrix (17) is diag ( τ + , τ ) where
τ ± = 1 δ ± γ ± coth ( 1 2 Δ t γ ) γ coth ( 1 2 Δ t γ + ) .
Using the identity coth ( x ± y ) = ( sinh 2 x sinh 2 y ) / ( cosh 2 x cosh 2 y ) , one gets that the eigenvalues of matrix (17) are
τ ± = sinh ( 1 2 Δ t γ ) ± γ sinh ( 1 2 Δ t δ ) / δ cosh ( 1 2 Δ t γ ) cosh ( 1 2 Δ t δ ) .
The greater of τ ± is τ + , and it is a straightforward exercise to show that τ + is an decreasing function of ω as long as
( ω Δ t ) 2 π 2 + ( γ Δ t / 2 ) 2 .
In practice, a numerical integrator is used, whose step size is chosen so that Δ t L = θ where θ is some fractional value, e.g., 1 2 , and L is the magnitude of largest eigenvalue of the Jacobian matrix of the right-hand side of the system of (first-order) stochastic differential equations. In particular,
ω Δ t = θ if   γ 2 ω ;
otherwise, it holds that ω Δ t < γ Δ t / 2 . Hence, Inequality (18) is more than satisfied. It is more complicated to analyze the behavior of values of τ for k > 1 , so we exploit the smallness of ω Δ t and γ Δ t to do an asymptotic analysis. From Expression (17), one sees that for P k , its τ max equals τ max ( k ) + O ( Δ t ) where τ max ( k ) is the greatest eigenvalue of
1 Δ t ( C 0 1 A 1 C 0 + A T ) .
For k = 1 , one has C 0 = diag ( 1 , 1 ) and τ max ( 1 ) = 2 γ / ( ω 2 Δ t ) , corresponding to the eigenfunction ω q . For k = 2 , one has
A 1 = 1 2 γ ω 2 γ 2 ω 2 2 γ ω ω 2 γ ω 0 0 ω 2 0 ω 2 ,
C 0 = diag ( 2 , 1 , 2 ) , and the largest eigenvalue for Expression (20) is
τ max ( 2 ) = 1 2 Δ t γ ω 2 + 2 γ + γ 2 ω 4 + 4 γ 2 ,
corresponding to an eigenfunction that is some linear combination of ω 2 q 2 1 and p 2 1 . Again τ max ( k ) is a decreasing function of ω for small enough Δ t . For general k, the size of the elements of A in Equation (16) increase with ω , which suggests that for practical values of Δ t , the largest eigenvalue for Expression (20) decreases with ω . This can be confirmed numerically. To reduce the number of parameters, write A = ω B ( γ / ω ) , and rewrite Expression (20) as
1 ω Δ t ( C 0 1 B ( γ / ω ) 1 C 0 + B ( γ / ω ) T ) .
The value τ max ( k ) is the largest eigenvalue of Matrix (21), so γ Δ t τ max ( k ) is a function only of the ratio ω / γ . Figure 1 plots this quantity for k = 1 , 2 , 3 , 4 , confirming that τ max ( k ) is decreasing function of ω for fixed values of γ and Δ t .
Therefore, for a multidimensional quadratic potential, τ max is very likely to be determined by the lowest frequency mode. (This is readily established for the spectral gap for reasonable values of Δ t .)
Having established that the lowest frequency mode is most likely responsible for τ max , let ω denote the lowest frequency ω 1 , and consider the optimal choice of γ . From Matrix (21), observe that ω Δ t τ max ( k ) is a function only of γ / ω . Accordingly, we choose γ / ω to minimize max k ω Δ t τ max ( k ) . However, Δ t may depend on γ , but only if γ > 2 ω ν , according to Equation (19). Let us initially ignore this possibility and revisit it after determining the optimal γ .
Assuming that the maximum of ω Δ t τ max ( k ) is attained for k = 1 or k = 2 , there are two possible locations for the best γ . One is the choice γ = 2 ω , for which ω Δ t τ max ( 2 ) = 1 + 2 and ω Δ t τ max ( 1 ) = 2 2 . The other is the choice γ = ( 6 / 2 ) ω where ω Δ t τ max ( k ) = 6 for both k = 1 and k = 2 . The latter is the better choice. This and the assumption that the maximum occurs for k < 3 is supported by Figure 2, which plots τ max as a function of γ for fixed values of ω and Δ t . Since ( 6 / 2 ) ω 1 < 2 ω ν , we see from Equation (19) that there is no concern about Δ t limiting the value of γ .
Note that for the optimal γ , the dynamics is underdamped in every mode.

6. Discussion and Conclusions

It is asserted in the literature [1,4] that, for comparing MCMC propagators and for ensuring reasonably reliable simulation results, it is of great value to estimate the maximum autocorrelation time over all possible observables. Certainly, this would seem desirable for comparing propagators, since it is best to assume as little as possible about the actual observables that are to be estimated from the Markov chain. In addition, the longest IAcT would be more reliably estimated than any other IAcT, because its accurate estimation would call for a greatest number of samples; moreover, the accuracy of estimates of other IAcTs might be influenced by the longest IAcT. Unfortunately, estimating IAcTs and assessing such estimates is a difficult task, which can benefit greatly from further research.
In this article, it is suggested that the maximum IAcT be estimated by considering an arbitrary linear combination of basis functions chosen to capture the lowest frequency motions of the dynamics of the MCMC chain. It is also worth considering the approach based on a set of well chosen indicator functions [11].
Another quantity for measuring performance of an MCMC propagator is the spectral gap. It is related to equilibration time, but as a predictor of IAcTs, it can be arbitrarily too pessimistic or it can be too optimistic. Such is the case for irreversible propagators, especially for energy landscapes where entropy barriers dominate energy barriers.
Shorter integrated autocorrelation times are thus a primary consideration in the design of MCMC propagators. Simple examples are constructed to explain how irreversibility might help, though their relevance to practical propagators is unclear. Nonetheless, it is advantageous not to insist on reversibility. In particular, the ballistic component of propagators like hybrid Monte Carlo and Langevin dynamics may offer dramatic speedups for overcoming entropy barrriers, though no advantage for energy barriers.
Irreversible propagators typically have a parameter that determines the extent of diffusive behavior. For discrete Langevin dynamics applied to a multidimensional quadratic potential, the optimal value corresponds to slightly underdamped dynamics for the lowest frequency mode, with other modes experiencing even lighter dampling. However, this conclusion is not necessarily indicative of the optimal damping for the usual situation, in which there a multiplicity of energy minima, and further studies are needed.
In a nutshell, general-purpose sampling benefits greatly from the use of irreversible propagators in extended state space, with “diffusion parameter(s)” chosen to minimize the maximum integrated autocorrelation time, which for a quadratic potential energy corresponds to moderately underdamped dynamics in the lowest frequency mode.

Author Contributions

Youhan Fang performed the analysis for Section 5.2 and the computation for Section 5.3. Robert D. Skeel did the analysis for Section 3.1, Section 4, Section 5.1 and Section 5.3 and wrote the paper. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Fang, Y.; Cao, Y.; Skeel, R. Quasi-Reliable Estimates of Effective Sample Size. arXiv, 2017; arXiv:1705.03831v2. [Google Scholar]
  2. Schütte, C.; Huisinga, W. Biomolecular Conformations as Metastable Sets of Markov Chains. In Proceedings of the 38th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 4–6 October 2000; pp. 1106–1115. [Google Scholar]
  3. Schütte, C.; Sarich, M. Metastability and Markov State Models in Molecular Dynamics; American Mathematical Society: Providence, RI, USA, 2013. [Google Scholar]
  4. Lyman, E.; Zuckerman, D.M. On the Structural Convergence of Biomolecular Simulations by Determination of the Effective Sample Size. J. Phys. Chem. B 2007, 111, 12876–12882. [Google Scholar] [CrossRef] [PubMed]
  5. Rey-Bellet, L.; Spiliopoulos, K. Irreversible Langevin Samplers and Variance Reduction: A Large Deviations Approach. Nonlinearity 2015, 28, 2081–2104. [Google Scholar] [CrossRef]
  6. Leimkuhler, B.; Matthews, C. Rational Construction of Stochastic Numerical Methods for Molecular Sampling. Appl. Math. Res. Express 2013, 2013, 34–56. [Google Scholar] [CrossRef]
  7. Leimkuhler, B.; Matthews, C.; Stoltz, G. The Computation of Averages from Equilibrium and Nonequilibrium Langevin Molecular Dynamics. IMA J. Numer. Anal. 2016, 36, 13–79. [Google Scholar] [CrossRef]
  8. Geyer, C.J. Practical Markov Chain Monte Carlo. Stat. Sci. 1992, 7, 473–511. [Google Scholar] [CrossRef]
  9. Priestly, M.B. Spectral Analysis and Time Series; Academic Press: Cambridge, MA, USA, 1981. [Google Scholar]
  10. Goodman, J. Acor, Statistical Analysis of a Time Series. 2009. Available online: http://www.math.nyu.edu/faculty/goodman/software/acor/ (accessed on 4 August 2014).
  11. Zhang, X.; Bhatt, D.; Zuckerman, D.M. Automated Sampling Assessment for Molecular Simulations Using the Effective Sample Size. J. Chem. Theory Comput. 2010, 6, 3048–3057. [Google Scholar] [CrossRef] [PubMed]
  12. Cancés, E.; Legoll, F.; Stoltz, G. Special Issue on Molecular Modelling: Theoretical and Numerical Comparison of Some Sampling Methods for Molecular Dynamics. ESAIM Math. Model. Numer. Anal. 2007, 41, 351–389. [Google Scholar] [CrossRef]
  13. Diaconis, P.; Holmes, S.; Neal, R.M. Analysis of a Nonreversible Markov Chain Sampler. Ann. Appl. Probab. 2000, 10, 726–752. [Google Scholar] [CrossRef]
  14. Risken, H. The Fokker-Planck Equation. Methods of Solution and Applications; Springer Series in Synergetics; Springer: New York, NY, USA, 1984; Volume 18. [Google Scholar]
  15. Kozlov, S. Effective diffusion for the Fokker-Planck equation. Math. Notes 1989, 45, 360–368. [Google Scholar] [CrossRef]
Figure 1. γ Δ t τ max ( k ) vs. ω / γ for k = 1 , 2 , 3 , 4 .
Figure 1. γ Δ t τ max ( k ) vs. ω / γ for k = 1 , 2 , 3 , 4 .
Entropy 19 00561 g001
Figure 2. ω Δ t τ max ( k ) vs. γ / ω for k = 1 , 2 , 3 , 4 .
Figure 2. ω Δ t τ max ( k ) vs. γ / ω for k = 1 , 2 , 3 , 4 .
Entropy 19 00561 g002
Table 1. τ max is the IAcT for F and τ max rv is that for F rv = 1 2 ( F + F T ) .
Table 1. τ max is the IAcT for F and τ max rv is that for F rv = 1 2 ( F + F T ) .
(a) τ max
n ε 0.10.010.001
10909909990
100900990099,900
1000900099,000999,000
(b) τ max rv τ max
n ε 0.10.010.001
10353333
100391035113355
1000403,612389,921351,047

Share and Cite

MDPI and ACS Style

Skeel, R.D.; Fang, Y. Comparing Markov Chain Samplers for Molecular Simulation. Entropy 2017, 19, 561. https://doi.org/10.3390/e19100561

AMA Style

Skeel RD, Fang Y. Comparing Markov Chain Samplers for Molecular Simulation. Entropy. 2017; 19(10):561. https://doi.org/10.3390/e19100561

Chicago/Turabian Style

Skeel, Robert D., and Youhan Fang. 2017. "Comparing Markov Chain Samplers for Molecular Simulation" Entropy 19, no. 10: 561. https://doi.org/10.3390/e19100561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop