首页 凸面最优法的统计模型参数估计

凸面最优法的统计模型参数估计

举报
开通vip

凸面最优法的统计模型参数估计 IEEE SIGNAL PROCESSING MAGAZINE [115] MAY 20101053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE [115] MAY 20101053-5888/10/$26.00©2010IEEE I n machine learning, there are two distinct groups of learning methods for building pattern classi...

凸面最优法的统计模型参数估计
IEEE SIGNAL PROCESSING MAGAZINE [115] MAY 20101053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE [115] MAY 20101053-5888/10/$26.00©2010IEEE I n machine learning, there are two distinct groups of learning methods for building pattern classifiers: generative learning and discriminative learning. The generative learning scheme aims to estimate the joint probability distribution of observation data and class label and then use the estimated distribution for classification according to the well-known Bayes decision rule. In generative learning, it is normal practice to adopt the so-called parametric modeling approach, where it is assumed that unknown probability dis- tributions belong to computationally tractable function fami- lies. In this way, the difficult density estimation problem becomes a more tractable parameter estimation problem. The advantage of generative learning is that some inherent dependency structures and/or independency assumptions applicable to the underlying data can be explicitly exploited by choosing an appropriate statistical model, such as mixture models or even highly structured graphi- cal models. On the other hand, the discriminative learn- ing scheme makes no explicit attempt to model the data distribution and instead optimizes a mapping function from inputs to any desired outputs. As a result, only the decision boundary is adjusted without forming a data gen- erator in the entire feature space. The advantage of dis- criminative learning lies in that the mapping function can be estimated based on criteria that are more relevant to the ultimate classification and regression task. Because of their complementary nature, recent research work in machine learn- ing [1], [9], [10], [12], [32] has shown the potential benefit of combining generative and discriminative learning methods. One active research topic in speech and language processing is how to learn generative models using discriminative learning approaches. For example, discriminative training (DT) of hidden Markov models (HMMs) for automatic speech recognition (ASR) has been intensively [ Hui Jiang and Xinwei Li] [An advanced method of discriminative training for speech and language processing] © BRAND X PICTURES Digital Object Identifier 10.1109/MSP.2010.936018 IEEE SIGNAL PROCESSING MAGAZINE [116] MAY 2010 studied for several decades. The essential idea of DT is to apply a discriminative criterion to the training procedure of HMMs. Many discriminative criteria have been proposed for ASR, such as maximum mutu- al information estimation (MMIE) [2], [35], minimum classification error (MCE) [18], minimum phone (word) error (MPE/MWE) [29], and large margin estimation (LME) [13], [20], [31], [40]. In a standard DT process, an objective function is first constructed accord- ing to one of these discriminative criteria. Then, optimization methods are used to maximize/minimize the objective func- tion with respect to the model parameters. The major difficulty of DT in ASR lies in the fact that normally the above-men- tioned DT criteria result in large-scale nonconvex optimization problems. For example, in a state-of-the-art large vocabulary ASR system, these DT criteria normally lead to complex non- convex objective functions involving over millions or even tens of millions of free variables. Generally speaking, optimizing nonconvex objective functions of such a large number of vari- ables is extremely difficult since it is very easy to get trapped in a shallow local optimum point in the complicated surface of the objective functions. During the last two decades, a signifi- cant amount of research effort in the speech community has been devoted to this problem. Many different optimization methods have been proposed and applied to DT of HMMs in ASR. For example, the extended Baum-Welch (EBW) method was proposed based on a growth function [7], which was first applied to MMIE [35] and then extended to other discrimina- tive criteria including MCE and MPE/MWE [8]. Moreover, a stochastic approximation method called generalized probabi- listic descent (GPD) [18] was first proposed for MCE, and then an approximate second-order Quasi-Newton method called quickprop was applied to MCE and other criteria [26]. More recently, constrained line search [21] and trust region [37], [22] methods have also been proposed for DT of HMMs in ASR. Generally speaking, all of these optimization methods are non- convex in nature and normally lead to better recognition per- formance only when carefully and skillfully implemented, such as in [35]. These nonconvex optimization methods, especially the EBW approach, remain popular optimization methods for DT of HMMs in ASR. Interested readers may refer to several recent survey articles [8], [17] for details. In this tutorial arti- cle, as part of applications of convex optimization in signal processing, we focus on more recent research work that applies convex optimization methods to DT for speech and language processing applications. INTRODUCTION To the best of our knowledge, there have been at least two independent research efforts to apply convex optimization methods to the discriminative learning of HMMs for ASR. The advantage of convex optimization is that it does not suffer from the local optimum problem because any local optimum is always globally optimal in convex optimiza- tion problems. Both research groups have been motivated by advances of large-margin classifiers in machine learning and attempt to estimate Gaussian mixture HMMs for ASR based on large margin criterion. In [30] and [31], a large margin-based DT method was proposed to estimate Gaussian mixture models for multiclass pattern classification tasks and then extended to Gaussian mixture HMMs for sequential clas- sification such as ASR. In this work, LME of Gaussians is for- mulated as a regularized optimization problem where the regularization term is calculated based on traces of parameter matrices and a so-called reparamerization method is pro- posed to relax multivariate Gaussian parameters to formulate LME as a convex optimization problem, which is solved by a customized gradient descent method for efficiency. In [13] and [15], a different strategy is applied to LME of Gaussian mixture HMMs for ASR, where LME of HMMs is directly for- mulated as a minimax optimization problem based on the principle of maximizing the minimum margin of the HMM- based classifiers. Then, an iterative optimization method, called approximation optimization (AM), is used to solve the original minimax optimization problem through a locality approximation. Convex relaxation methods may be used to convert the original nonconvex optimization into standard convex optimization problems, such as semidefinite program- ming (SDP) [19], [39] and second-order cone programing (SOCP) [36], [38]. Even though this work was initially pro- posed only for LME of mean parameters of Gaussian mixture HMMs, it can be easily extended to estimate other parameters of HMMs, such as covariance matrices and transition proba- bilities, as well as other discriminative criteria, such as MMIE and MCE. As shown in [16], the proposed AM method is gen- eral enough for DT of generative models from a wide class of statistical models, namely finite mixtures of exponential fam- ily models, while the method in [30], [31] is specially tailored for Gaussians. Over the past few years, we have successfully applied a variety of convex optimization methods under the AM framework to solve some DT problems arising in speech and language processing tasks, initially starting with Gaussian mixture HMMs for speech recognition [19], [36], [38], [39] and more recently extending to multinomial-based models for language processing [25], [28]. Here, convex opti- mization plays a critical role in solving the underlying large- scale optimization problems in DT of the generative models. In this tutorial article, as an example of convex optimization for signal processing, we summarize recent research efforts that apply convex optimization to DT of various statistical models under the AM framework and highlight the emerging role of convex optimization in these traditional speech and language applications. ONE ACTIVE RESEARCH TOPIC IN SPEECH AND LANGUAGE PROCESSING IS TO STUDY HOW TO LEARN GENERATIVE MODELS USING DISCRIMINATIVE LEARNING APPROACHES. IEEE SIGNAL PROCESSING MAGAZINE [117] MAY 2010 GENERATIVE VERSUS DISCRIMINATIVE LEARNING Pattern classification is the task of classifying an unknown object into one of a set of priori pattern classes based on some noisy observation data (also known as features). The genera- tive learning scheme originates from the Bayes decision rule (also known as maximum a poste- rior decision rule), which is guaranteed to achieve the mini- mum classification error rate provided the underlying data distributions are known. For an unknown object, we assume its class label, S, and its observation data, X, are both random vari- ables, if the true joint probability distribution of X and S is known as p 1X, S 2 5 p 1S 2 # p 1X|S 2 , the optimal pattern classifier can be built based on the MAP (maximum a posterior) decision rule. In other words, class label of an unknown observation X is predicted as follows: S^5 arg max S p 1S | X 2 5 arg max S p 1S 2 # p 1X|S 2 . (1) As indicated in (1), constructing the optimal pattern classifier requires to compute two probability distributions, namely the prior probability p 1S 2 and the conditional probability p 1X|S 2 , which are usually unknown in practice. The essential problem in generative learning is how to estimate them from some available training data. To simplify the estimation problem, we adopt the so-called parametric modeling approach, where it is assumed that the unknown probability distributions belong to a computationally tractable function family, such as the exponential family [4]. Furthermore, some latent vari- ables may be introduced to derive even more complex models. A rich family of multimodal distributions can be obtained by introducing discrete latent variables. In this way, the difficult density estimation problem turns into a more tractable parameter estimation problem. One major benefit of genera- tive learning is that there exist efficient learning algorithms that can estimate rather complicated models based on the maximum likelihood (ML) criterion, e.g., the expectation- maximization (EM) algorithm [6]. Discriminative learning uses a mapping function (from observation X to class label S) to model class boundaries without forming data distribution in the entire feature space. The mapping function is estimated using criteria that are more relevant to the ultimate classification and regression task, such as maximum condition likelihood (MCL) estima- tion [11], empirical risk minimization (ERM), or LME [41]. Traditionally, only some relatively simple models are consid- ered in the discriminative learning scheme due to computa- tional complexity issues, e.g., linear discriminant functions for support vector machines (SVMs). Other discriminative models include logistic regression and neural networks. It is an interesting topic to extend the discriminative learning scheme to other more complicated models, particularly mix- ture models and latent graphi- cal models widely adopted in the generat ive l earn ing scheme. Recent work in both machine learning [1], [9], [10], [12], [32] and speech recogni- tion [2], [18], [35] has shown the benefit of learning genera- tive models discriminatively. Discriminative learning of these generative models imposes a computational challenge since it normally leads to a compli- cated nonconvex optimization problem. A significant amount of research effort has been devoted to this problem. In this article, we introduce one method proposed for discriminative learning of a wide class of generative models, i.e., mixtures of exponential family (e-family), and particularly underline the role of convex optimization in this method by using convex relaxation techniques. In the next section, we first introduce notations for a general canonical form of the e-family and its mixtures since many pop- ular generative models fall into this category. Following that, we will present a general framework to learn mixtures of exponen- tial family models in a discriminative way. STATISTICAL MODELS: THE E-FAMILY AND ITS MIXTURE In practice, we normally choose generative models from a rather general family of statistical models, namely the e-fam- ily, due to its highly computationally tractability in parame- ter estimation. THE E-FAMILY As in [4], all statistical models in the e-family can be represent- ed as the following general canonic form: p 1X|l 2 5 exp 5A 1x 2 1 xTl2K 1l 2 6, where l in bold denotes the natural parameter vector of the e-family distribution, and x in bold is called the sufficient statis- tics. In most cases, the natural parameters l may take a differ- ent form from the original model parameter l and sufficient statistics x can be represented as a function of original data X. Furthermore, K 1l 2 is called cumulant generating function, which is a convex function of the natural parameters l and independent of X. A 1x 2 is a function of sufficient statistics x and independent of l. Many common probability distributions belong to the e-family, including Gaussian, multinomial, Bernoulli, expo- nential, gamma, and poison. For example, one-dimensional Gaussian (with unknown mean and variance) can be repre- sented in the canonic e-family form as N 1x|m, s 2 2 5 1"2ps 2 e 2 1x2m22 2s2 5 exp5A 1x 2 1 xTl2K 1l 26, where sufficient statistics x5 3x, 2 x2/2 4 , the natural param- eters l5 3m/s2, 1/s2 4 and A 1x 2 52 11/2 2 ln 2p, and the THE ADVANTAGE OF CONVEX OPTIMIZATION IS THAT IT DOES NOT SUFFER FROM THE LOCAL OPTIMUM PROBLEM BECAUSE ANY LOCAL OPTIMUM IS ALWAYS GLOBALLY OPTIMAL IN CONVEX OPTIMIZATION PROBLEMS. IEEE SIGNAL PROCESSING MAGAZINE [118] MAY 2010 cumulant generating function K 1l 2 52 11/2 2 ln l21 l12/2l2 (with the constraint l2 . 0), which is clearly a convex func- tion of l. We can represent all other common probability dis- tributions in a similar way. In Table 1, we have summarized the major results for multivari- ant Gaussian and multinomial distributions since they are mostly relevant to the remainder of this article. One important property of the e-family is that products of e-family distributions remain in the e-family. The natural parameters of the product distribution can be easily con- structed by concatenating natural parameters of all member distributions. MIXTURES OF THE E-FAMILY All e-family distributions are computationally tractable in parameter estimation but they are normally too simple to model real-word data appropriately. A widely adopted strategy is to introduce latent variables to derive more complicated models from these simple e-family distributions. A rather rich family of multimodal distributions can be obtained by intro- ducing discrete latent variables to derive the so-called finite mixture models. Mixtures of the e-family are a generalization of the above simple e-family distribution by considering a convex combina- tion of different e-family distributions with different parameters. In a very general form, mixtures of the e-family can be repre- sented as the following form: p 1X|l 2 5 a k wk # exp5Ak 1xk 2 1 lTxk2Kk 1l 2 6, where l denotes the natural parameters of the mixture model that is concatenated from natural parameters of its component distributions, and sufficient statistics xk and Kk 1l 2 may vary in different mixture components, and wk stands for mixture weight and satisfies the sum-to-one con- straint ak wk5 1. Mixtures of the e-family represent a wide class of sta- tistical models and are fre- quently used in machine learning and pattern classification. Mixtures of the e-fami- ly include the Gaussian mixture model (GMM), the multi- nomial mixture model (MMM), and the HMM. In the HMM, the latent variable is a complete state sequence, xk repre- sents sufficient statistics collected along a state sequence, and each mixture component is product of various e-fami- ly distributions. In the generative learning scheme, statistical models are normally estimated from available training data based on the maximum likelihood (ML) criterion. Due to the convexity of the cumulant generating function K 1l 2 , ML estimation of an e-family model normally result in a simple solution. In many cases, ML estimation of e-family models can even be solved by a closed-form solution. Furthermore, ML estimation of mixtures of the e-family can also be iteratively derived in a rel- atively easy manner, relying on the well-known EM algo- rithm [6]. A NEW FRAMEWORK FOR DT OF GENERATIVE MODELS In this section, we consider the use of alternative discrimina- tive learning approaches to estimate mixtures of e-family mod- els. Unlike the conventional ML estimation, discriminative learning of these generative models faces significant computa- tional challenges, no longer enjoying simple updating rules and relatively fast convergence of the EM algorithm. Generally speaking, discriminative learning of generative models is an optimization problem, where an objective function must be optimized in an iterative manner. The discriminative objective function is normally formulated according to some popular discriminative criteria, such as maximum mutual information [TABLE 1] CANONIC FORM OF E-FAMILY DISTRIBUTIONS: A) MULTIVARIATE GAUSSIAN WITH KNOWN PRECISION MATRIX; B) MULTIVARIATE GAUSSIAN WITH UNKNOWN MEAN AND PRECISION MATRIX; C) MULTINOMIAL IN A CONSTRAINED FORM; D) MULTINOMIAL IN AN UNCONSTRAINED FORM. l x K 1l 2 CONSTRAINT A) GAUSSIAN N 1m, S0 2 (MEAN) m S021x 1 2 lTS0 21l B) GAUSSIAN N 1m, S 2 (MEAN,COV) 3S21m, S21 4 cx,2 1 2 x # xT d 2 1 2 ln |l2|1 1 2 l1 Tl2 21l1 l2 s 0 C) MULTINOMIAL (CONSTRAINED) C # q D d51 md x d 3 lnm1, c , lnmD 4 3x1, c , xD 4 1 l , 0 a D d51 eld5 1 D) MULTINOMIAL (UNCONSTRAINED) C # q D d51 md x d c ln m1 12 a D21 d51 md , c , ln mD21 12 a D21 d51 md d c x1 a D d51 xd , c , xD21 a D d51 xd d lna11 aD21d51 eldb ONE IMPORTANT PROPERTY OF THE E-FAMILY IS THAT PRODUCTS OF E-FAMILY DISTRIBUTIONS REMAIN IN THE E-FAMILY. IEEE SIGNAL PROCESSING MAGAZINE [119] MAY 2010 (MMI) [2], [35] (maximum conditional likelihood [11]), MCE [18], and LME [13], [31]. Historically, these discrimina- tive criteria have been pro- posed from different contexts, in the following, we first pres- ent a unified view for discrimi- native criteria, centering on the concept of margin. Following that, we discuss a general method to optimize these discriminative objective functions using locality approximation and convex optimization, named as the AM method. A UNIFIED VIEW OF VARIOUS DT CRITERIA In supervised learning, given a set of training samples, denoted as D5 5X1, X2, c , XT6, we usually know the true class labels for all training samples in D, denoted as L5 5S1, S2, c , ST6. For notational convenience, we use the uppercase letter St to represent the true transcription of each training sample Xt, and use lowercase st to denote a variable that may take all possible labels in a hypothesis space. Following [5] and [34], a multiclass separation margin for each training sample Xt is defined as follows: d 1Xt | L 2 5 ln p 1St, Xt|L 2 2 max st[Mt ln p 1st, Xt|L 2 , (2) where L denotes the whole set of model parameters from all classes and max is taken over a particular hypothesis space, denoted as Mt
本文档为【凸面最优法的统计模型参数估计】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_138139
暂无简介~
格式:pdf
大小:956KB
软件:PDF阅读器
页数:13
分类:
上传时间:2011-10-26
浏览量:25