首页 An Introduction to MCMC for Machine Learning

An Introduction to MCMC for Machine Learning

举报
开通vip

An Introduction to MCMC for Machine Learning Machine Learning, 50, 5–43, 2003 c© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. An Introduction to MCMC for Machine Learning CHRISTOPHE ANDRIEU C.Andrieu@bristol.ac.uk Department of Mathematics, Statistics Group, University of Bristol, ...

An Introduction to MCMC for Machine Learning
Machine Learning, 50, 5–43, 2003 c© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. An Introduction to MCMC for Machine Learning CHRISTOPHE ANDRIEU C.Andrieu@bristol.ac.uk Department of Mathematics, Statistics Group, University of Bristol, University Walk, Bristol BS8 1TW, UK NANDO DE FREITAS nando@cs.ubc.ca Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, BC V6T 1Z4, Canada ARNAUD DOUCET doucet@ee.mu.oz.au Department of Electrical and Electronic Engineering, University of Melbourne, Parkville, Victoria 3052, Australia MICHAEL I. JORDAN jordan@cs.berkeley.edu Departments of Computer Science and Statistics, University of California at Berkeley, 387 Soda Hall, Berkeley, CA 94720-1776, USA Abstract. This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, it discusses new interesting research horizons. Keywords: Markov chain Monte Carlo, MCMC, sampling, stochastic algorithms 1. Introduction A recent survey places the Metropolis algorithm among the ten algorithms that have had the greatest influence on the development and practice of science and engineering in the 20th century (Beichl & Sullivan, 2000). This algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC). These algorithms have played a significant role in statistics, econometrics, physics and computing science over the last two decades. There are several high-dimensional problems, such as computing the volume of a convex body in d dimensions, for which MCMC simulation is the only known general approach for providing a solution within a reasonable time (polynomial in d) (Dyer, Frieze, & Kannan, 1991; Jerrum & Sinclair, 1996). While convalescing from an illness in 1946, Stan Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with 52 cards would come out successfully (Eckhard, 1987). After attempting exhaustive combinatorial calculations, he decided to go for the more practical approach of laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation. 6 C. ANDRIEU ET AL. Stan Ulam soon realised that computers could be used in this fashion to answer ques- tions of neutron diffusion and mathematical physics. He contacted John Von Neumann, who understood the great potential of this idea. Over the next few years, Ulam and Von Neumann developed many Monte Carlo algorithms, including importance sampling and rejection sampling. Enrico Fermi in the 1930’s also used Monte Carlo in the calculation of neutron diffusion, and later designed the FERMIAC, a Monte Carlo mechanical device that performed calculations (Anderson, 1986). In the 1940’s Nick Metropolis, a young physicist, designed new controls for the state-of-the-art computer (ENIAC) with Klari Von Neumann, John’s wife. He was fascinated with Monte Carlo methods and this new computing device. Soon he designed an improved computer, which he named the MANIAC in the hope that computer scientists would stop using acronyms. During the time he spent working on the computing machines, many mathematicians and physicists (Fermi, Von Neumann, Ulam, Teller, Richtmyer, Bethe, Feynman, & Gamow) would go to him with their work problems. Eventually in 1949, he published the first public document on Monte Carlo simulation with Stan Ulam (Metropolis & Ulam, 1949). This paper introduces, among other ideas, Monte Carlo particle methods, which form the basis of modern sequential Monte Carlo methods such as bootstrap filters, condensation, and survival of the fittest algorithms (Doucet, de Freitas, & Gordon, 2001). Soon after, he proposed the Metropolis algorithm with the Tellers and the Rosenbluths (Metropolis et al., 1953). Many papers on Monte Carlo simulation appeared in the physics literature after 1953. From an inference perspective, the most significant contribution was the generalisation of the Metropolis algorithm by Hastings in 1970. Hastings and his student Peskun showed that Metropolis and the more general Metropolis-Hastings algorithms are particular instances of a large family of algorithms, which also includes the Boltzmann algorithm (Hastings, 1970; Peskun, 1973). They studied the optimality of these algorithms and introduced the formulation of the Metropolis-Hastings algorithm that we adopt in this paper. In the 1980’s, two important MCMC papers appeared in the fields of computer vision and artificial in- telligence (Geman & Geman, 1984; Pearl, 1987). Despite the existence of a few MCMC publications in the statistics literature at this time, it is generally accepted that it was only in 1990 that MCMC made the first significant impact in statistics (Gelfand & Smith, 1990). In the neural networks literature, the publication of Neal (1996) was particularly influential. In the introduction to this special issue, we focus on describing algorithms that we feel are the main building blocks in modern MCMC programs. We should emphasize that in order to obtain the best results out of this class of algorithms, it is important that we do not treat them as black boxes, but instead try to incorporate as much domain specific knowledge as possible into their design. MCMC algorithms typically require the design of proposal mechanisms to generate candidate hypotheses. Many existing machine learning algorithms can be adapted to become proposal mechanisms (de Freitas et al., 2001). This is often essential to obtain MCMC algorithms that converge quickly. In addition to this, we believe that the machine learning community can contribute significantly to the solution of many open problems in the MCMC field. For this purpose, we have outlined several “hot” research directions at the end of this paper. Finally, readers are encouraged to consult the excellent texts of Chen, Shao, and Ibrahim (2001), Gilks, Richardson, and Spiegelhalter (1996), Liu (2001), Meyn and Tweedie (1993), Robert and Casella (1999) and review papers by Besag INTRODUCTION 7 et al. (1995), Brooks (1998), Diaconis and Saloff-Coste (1998), Jerrum and Sinclair (1996), Neal (1993), and Tierney (1994) for more information on MCMC. The remainder of this paper is organised as follows. In Part 2, we outline the general problems and introduce simple Monte Carlo simulation, rejection sampling and importance sampling. Part 3 deals with the introduction of MCMC and the presentation of the most popular MCMC algorithms. In Part 4, we describe some important research frontiers. To make the paper more accessible, we make no notational distinction between distributions and densities until the section on reversible jump MCMC. 2. MCMC motivation MCMC techniques are often applied to solve integration and optimisation problems in large dimensional spaces. These two types of problem play a fundamental role in machine learning, physics, statistics, econometrics and decision analysis. The following are just some examples. 1. Bayesian inference and learning. Given some unknown variables x ∈ X and data y ∈ Y , the following typically intractable integration problems are central to Bayesian statistics (a) Normalisation. To obtain the posterior p(x | y) given the prior p(x) and likelihood p(y | x), the normalising factor in Bayes’ theorem needs to be computed p(x | y) = p(y | x)p(x)∫ X p(y | x ′)p(x ′) dx ′ . (b) Marginalisation. Given the joint posterior of (x, z) ∈ X × Z , we may often be interested in the marginal posterior p(x | y) = ∫ Z p(x, z | y) dz. (c) Expectation. The objective of the analysis is often to obtain summary statistics of the form Ep(x |y)( f (x)) = ∫ X f (x)p(x | y) dx for some function of interest f : X → Rn f integrable with respect to p(x | y). Examples of appropriate functions include the conditional mean, in which case f (x) = x , or the conditional covariance of x where f (x) = xx′−Ep(x |y)(x)E′p(x |y)(x). 2. Statistical mechanics. Here, one needs to compute the partition function Z of a system with states s and Hamiltonian E(s) Z = ∑ s exp [ − E(s) kT ] , where k is the Boltzmann’s constant and T denotes the temperature of the system. Summing over the large number of possible configurations is prohibitively expensive (Baxter, 1982). Note that the problems of computing the partition function and the normalising constant in statistical inference are analogous. 8 C. ANDRIEU ET AL. 3. Optimisation. The goal of optimisation is to extract the solution that minimises some objective function from a large set of feasible solutions. In fact, this set can be contin- uous and unbounded. In general, it is too computationally expensive to compare all the solutions to find out which one is optimal. 4. Penalised likelihood model selection. This task typically involves two steps. First, one finds the maximum likelihood (ML) estimates for each model separately. Then one uses a penalisation term (for example MDL, BIC or AIC) to select one of the models. The problem with this approach is that the initial set of models can be very large. Moreover, many of those models are of not interest and, therefore, computing resources are wasted. Although we have emphasized integration and optimisation, MCMC also plays a funda- mental role in the simulation of physical systems. This is of great relevance in nuclear physics and computer graphics (Chenney & Forsyth, 2000; Kalos & Whitlock, 1986; Veach & Guibas, 1997). 2.1. The Monte Carlo principle The idea of Monte Carlo simulation is to draw an i.i.d. set of samples {x (i)}Ni=1 from a target density p(x) defined on a high-dimensional space X (e.g. the set of possible configurations of a system, the space on which the posterior is defined, or the combinatorial set of feasible solutions). These N samples can be used to approximate the target density with the following empirical point-mass function pN (x) = 1N N∑ i=1 δx (i) (x), where δx (i) (x) denotes the delta-Dirac mass located at x (i). Consequently, one can approx- imate the integrals (or very large sums) I ( f ) with tractable sums IN ( f ) that converge as follows IN ( f ) = 1N N∑ i=1 f (x (i)) a.s.−−−→ N→∞ I ( f ) = ∫ X f (x)p(x) dx . That is, the estimate IN ( f ) is unbiased and by the strong law of large numbers, it will almost surely (a.s.) converge to I ( f ). If the variance (in the univariate case for simplicity) of f (x) satisfies σ 2f � Ep(x)( f 2(x)) − I 2( f ) < ∞, then the variance of the estimator IN ( f ) is equal to var(IN ( f )) = σ 2 f N and a central limit theorem yields convergence in distribution of the error √ N (IN ( f ) − I ( f )) =⇒ N→∞ N (0, σ 2f ), where =⇒ denotes convergence in distribution (Robert & Casella, 1999; Section 3.2). The advantage of Monte Carlo integration over deterministic integration arises from the fact that the former positions the integration grid (samples) in regions of high probability. INTRODUCTION 9 The N samples can also be used to obtain a maximum of the objective function p(x) as follows xˆ = arg max x (i);i=1,...,N p ( x (i) ) However, we will show later that it is possible to construct simulated annealing algorithms that allow us to sample approximately from a distribution whose support is the set of global maxima. When p(x) has standard form, e.g. Gaussian, it is straightforward to sample from it using easily available routines. However, when this is not the case, we need to introduce more sophisticated techniques based on rejection sampling, importance sampling and MCMC. 2.2. Rejection sampling We can sample from a distribution p(x), which is known up to a proportionality constant, by sampling from another easy-to-sample proposal distribution q(x) that satisfies p(x) ≤ Mq(x), M < ∞, using the accept/reject procedure describe in figure 1 (see also figure 2). The accepted x (i) can be easily shown to be sampled with probability p(x) (Robert & Figure 1. Rejection sampling algorithm. Here, u ∼ U(0,1) denotes the operation of sampling a uniform random variable on the interval (0, 1). Figure 2. Rejection sampling: Sample a candidate x (i) and a uniform variable u. Accept the candidate sample if uMq(x (i)) < p(x (i)), otherwise reject it. 10 C. ANDRIEU ET AL. Casella, 1999, p. 49). This simple method suffers from severe limitations. It is not always possible to bound p(x)/q(x) with a reasonable constant M over the whole space X . If M is too large, the acceptance probability Pr(x accepted) = Pr ( u < p(x) Mq(x) ) = 1 M will be too small. This makes the method impractical in high-dimensional scenarios. 2.3. Importance sampling Importance sampling is an alternative “classical” solution that goes back to the 1940’s; see for example (Geweke, 1989; Rubinstein, 1981). Let us introduce, again, an arbitrary importance proposal distribution q(x) such that its support includes the support of p(x). Then we can rewrite I ( f ) as follows I ( f ) = ∫ f (x) w(x) q(x) dx where w(x) � p(x)q(x) is known as the importance weight. Consequently, if one can simulate N i.i.d. samples {x (i)}Ni=1 according to q(x) and evaluate w(x (i)), a possible Monte Carlo estimate of I ( f ) is ˆI N ( f ) = N∑ i=1 f (x (i))w(x (i)) This estimator is unbiased and, under weak assumptions, the strong law of large num- bers applies, that is ˆI N( f ) a.s.−→ N→∞ I ( f ). It is clear that this integration method can also be interpreted as a sampling method where the posterior density p(x) is approximated by pˆN (x) = N∑ i=1 w ( x (i) ) δx (i) (x) and ˆI N ( f ) is nothing but the function f (x) integrated with respect to the empirical measure pˆN (x). Some proposal distributions q(x) will obviously be preferable to others. An important criterion for choosing an optimal proposal distribution is to find one that minimises the variance of the estimator ˆI N ( f ). The variance of f (x)w(x) with respect to q(x) is given by varq(x)( f (x)w(x)) = Eq(x)( f 2(x)w2(x)) − I 2( f ) (8) The second term on the right hand side does not depend on q(x) and hence we only need to minimise the first term, which according to Jensen’s inequality has the following lower INTRODUCTION 11 bound Eq(x)( f 2(x)w2(x)) ≥ ( Eq(x)(| f (x)|w(x)) )2 = ( ∫ | f (x)|p(x) dx)2 This lower bound is attained when we adopt the following optimal importance distribution q (x) = | f (x)|p(x)∫ | f (x)|p(x) dx The optimal proposal is not very useful in the sense that it is not easy to sample from | f (x)|p(x). However, it tells us that high sampling efficiency is achieved when we focus on sampling from p(x) in the important regions where | f (x)|p(x) is relatively large; hence the name importance sampling. This result implies that importance sampling estimates can be super-efficient. That is, for a a given function f (x), it is possible to find a distribution q(x) that yields an estimate with a lower variance than when using a perfect Monte Carlo method, i.e. with q(x) = p(x). This property is often exploited to evaluate the probability of rare events in communication networks (Smith, Shafi, & Gao, 1997). There the quantity of interest is a tail probability (bit error rate) and hence f (x) = IE (x) where IE (x) = 1 if x ∈ E and 0 otherwise (see figure 3). One could estimate the bit error rate more efficiently by sampling according to q(x) ∝ IE (x)p(x) than according to q(x) = p(x). That is, it is wasteful to propose candidates in regions of no utility. In many applications, the aim is usually different in the sense that Figure 3. Importance sampling: one should place more importance on sampling from the state space regions that matter. In this particular example one is interested in computing a tail probability of error (detecting infrequent abnormalities). 12 C. ANDRIEU ET AL. one wants to have a good approximation of p(x) and not of a particular integral with respect to p(x), so we often seek to have q(x) � p(x). As the dimension of the x increases, it becomes harder to obtain a suitable q(x) from which to draw samples. A sensible strategy is to adopt a parameterised q(x, θ ) and to adapt θ during the simulation. Adaptive importance sampling appears to have originated in the structural safety literature (Bucher, 1988), and has been extensively applied in the communications literature (Al-Qaq, Devetsikiotis, & Townsend, 1995; Remondo et al., 2000). This technique has also been exploited recently in the machine learning community (de Freitas et al., 2000; Cheng & Druzdzel, 2000; Ortiz & Kaelbling, 2000; Schuurmans & Southey, 2000). A popular adaptive strategy involves computing the derivative of the first term on the right hand side of Eq. (8) D(θ ) = Eq(x,θ ) ( f 2(x)w(x, θ )∂w(x, θ ) ∂θ ) and then updating the parameters as follows θt+1 = θt − α 1N N∑ i=1 f 2(x (i))w(x (i), θt)∂w ( x (i), θt ) ∂θt where α is a learning rate and x (i) ∼ q(x, θ ). Other optimisation approaches that make use of the Hessian are also possible. When the normalising constant of p(x) is unknown, it is still possible to apply the importance sampling method by rewriting I ( f ) as follows: I ( f ) = ∫ f (x)w(x)q(x) dx∫ w(x)q(x) dx where w(x) ∝ p(x)q(x) is now only known up to a normalising constant. The Monte Carlo estimate of I ( f ) becomes ˜I N ( f ) = 1 N ∑N i=1 f ( x (i) ) w ( x (i) ) 1 N ∑N j=1 w ( x (i) ) = N∑ i=1 f (x (i))w˜(x (i)) where w˜(x (i)) is a normalised importance weight. For N finite, ˜I N ( f ) is biased (ratio of two estimates) but asymptotically, under weak assumptions, the strong law of large numbers applies, that is ˜I N ( f ) a.s.−→ N→∞ I ( f ). Under additional assumptions a central limit theorem can be obtained (Geweke, 1989). The estimator ˜I N ( f ) has been shown to perform better than ˆI N ( f ) in some setups under squared error loss (Robert & Casella, 1999). If one is interested in obtaining M i.i.d. samples from pˆN (x), then an asymptotically (N/M → ∞) valid method consists of resampling M times according to the discrete distri- bution pˆN (x). This procedure results in M samples x˜ (i) with the possibility that x˜ (i) = x˜ ( j) INTRODUCTION 13 for i �= j . This method is known as sampling importance resampling (SIR) (Rubin, 1988). After resampling, the approximation of the target density is p˜M(x) = 1M M∑ i=1 δx˜ (i) (x) (13) The resampling scheme introduces some additional Monte Carlo variation. It is, therefore, not clear whether the SIR procedure can lead to practical gains in general. However, in the sequential Monte Carlo setting described in Section 4.3, it is essential to carry out this resampling step. We conclude this section by stating that even with adaptation, it is often impossible to obtain proposal distributions that are easy to sample from and good approximations at the same time. For this reason, we need to introduce more sophisticated sampling algorithms based on Markov chains. 3. MCMC algorithms MCMC is a strategy for generating samples x (i) while exploring the state space X using a Markov chain mechanism. This mechanism is constructed so that the chain spends more time in the most important regions. In particular, it is constructed so that the samples x (i) mimic samples drawn from the target distribution p(x). (We reiterate that we use MCMC when we cannot draw samples from p(x) directly, but can evaluate p(x) up to a normalising constant.) It is intuitive to introduce Markov chains on finite state spaces, where x (i) can only take s discrete values x (i) ∈X = {x1, x2, . . . , xs}. The stochastic process x (i
本文档为【An Introduction to MCMC for Machine Learning】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_688874
暂无简介~
格式:pdf
大小:479KB
软件:PDF阅读器
页数:39
分类:工学
上传时间:2014-03-19
浏览量:33