The estimation of the gradient of a dens...

The estimation of the gradient of a dens... 32 IERE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 1975 The Estimation of the Gradient of a Density Function, w ith Applications in Pattern - Recognition KEINOSUKE FUKUNAGA, SENIOR MEMBER, IEEE, AND LARRY D. HOSTETLER, MEMBER, IEEE...

32 IERE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 1975 The Estimation of the Gradient of a Density Function, w ith Applications in Pattern - Recognition KEINOSUKE FUKUNAGA, SENIOR MEMBER, IEEE, AND LARRY D. HOSTETLER, MEMBER, IEEE Abstract-Nonparametric density gradient estimation using a gen- eralized kernel approach is investigated. Conditions on the kernel func- tions are derived to guarantee asymptotic unbiasedness, consistency, and uniform consistenby of the estimates. The results are generalized to obtain a simple mean-shift estimate that can be extended in a k-nearest- neighbor approach. Applications of gradient estimation to pattern recognition are presented using clustering and intrinsic dimensionality problems, with the ultimate goal of providing further understanding of these problems in terms of density gradients. I. INTRODUCTION N ONPARAMETRIC estimation of probability density functions is based on the concept that the value of a density function at a continuity point can be estimated using the sample observations that fall within a small region around that point. This idea was perhaps first used in Fix and Hodges’ [l] original work. Rosenblatt [2], Parzen [3], and Cacoullos [4] generalized these results atid developed the Parzen kernel class of density estimates. These estimates were shown to be asymptotically unbiased, consistent in a mean-square sense, and uniformly consistent (in probability). Additional work by Nadaraya [S], later extended by Van Ryzin [6] to n dimensions, showed strong and uniformly strong consistency (with probability one). Similarly, the gradient of a probability density function can be estimated using the sample observations within a small region. In this paper we will use these concepts and results to derive a general kernel class of density gradient estimates. Although Bhattacharyya [7] and Schuster [8] considered estimating the density and all its derivatives in the univariate case, we are interested in the multivariate gradient, and our results follow more closely the Parzen kernel estimates previously mentioned. In Section II, we will develop the general form of the kernel gradient estimates and derive conditions on the kernel functions to assure asymptotically unbiased, con- sistent, and uniformly consistent estimates. By examining the form of the gradient estimate when a Gaussian kernel is used, in Section III we will be able to develop a simple mean-shift class of estimates. The approach uses the fact that the expected value of the observations within a small region about a point can be related to the density gradient Manuscript received February 1, 1973 ; revised June 13, 1974. This work was supported in part by the National Science Foundation under Grant GJ-35722 and in part by Bell Laboratories, Madison, N.J. K. Fukunaga is with the School of Electrical Engineering, Purdue University, Lafayette, Ind. 47907. L. D. I;Iostetler was with Bel! Laboratories, Whippany and Madison, fj;lFe 1s now with the Sandra Laboratones, Albuquerque, N. Mex. at that point. A simple modification results in a k-nearest- neighbor mean-shift class of estimates. This modification is similar to Loftsgaarden and Quesenberry’s [9] k-nearest- neighbor density estimates, in that the number of observa- tions within the region is fixed whereas in the previous estimates the volume of the region was fixed. In Section IV, applications of density gradient estimation to pattern recognition problems are investigated. By taking a mode-seeking approach, a recursive clustering algorithm is developed, and its properties studied. Interpretations of our results are presented along with examples. Applications to data filtering and ilitrinsic dimensionality determination are also .discussed. Section V is a summary. II. ESTIMATION OF THE DENSITY GRADIENT In most pattern recognition problems, very little if any information is available as to the true probability density function or even as to its form. Due to this lack of know- ledge about the density, we have to rely on nonparametric techniques to obtain density gradient estimates. A straight- forward approach for estimating a density gradient would be to first dbtain a differentiable nonparametric estimate of the probability density function and then take its gradient. The nonparametric density estimates we will use are Cacoullos’ [4] multivariate extensions of Parzen’s [3] univariate kernel estimates. A. Proposed Gradient Estimates Let X1,X,, * * - ,X, be a set of N independent and identi- cally distributed n-dimensional random vectors. Cacoullos [4] investigated multivariate kernel density estimators of the form MO = W ”)-l j$l W+(X where k(X) is a scalar function satisfying sup IW)l < 00 YeR” s Ik(Y)I dY < 00 R” lim )I Y link(Y) = 0 IIYII+m s k(Y) dY = 1 R” - xj>> (1) (2) (3) (4) (5) and where II * II is the ordinary Euclidean norm, and R” is the n-dimensional feature space. The parameter h, which FUKUNAGA AND HOSTETLER: GRADIENT OF DENSITY FUNCTION 33 is a function of the sample size N, is chosen to satisfy lim h(N) = 0 (6) N-W functions that satisfy lixq, p(X) = 0, Ilxll-~ (15) to guarantee asymptotic unbiasedness of the estimate. Mean-square consistency of the estimate is assured by the condit ion lim A%“(N) = co. (7) N-tm Although the condit ion (15) must be imposed on the density function, practically all density functions satisfy this condition. Thus lim E{Q,p,(X)) = V,p(X) N-CO (16) Uniform consistency (in probability) is assured by the at the points of continuity of V,p(X), provided that 1) condit ion p(X) satisfies (15); 2) h(N) goes to zero as N -+ co (see lim FIJI?“(N) = co, (8) (6)); 3) k(X) satisfies (2)--(5); and 4) sR” Iap(Y)/ayil dY < CO, N’G) for i = 1,2; * *,n. provided the true density p(X) is uniformly continuous. Additional condit ions were found by Van Ryzin [6] to assure strong and uniformly strong consistency (with probability one). The proof is given in the Appendix. C. Consistency Motivated by this general class of nonparametr ic density estimates, we use the differentiable kernel function and then estimate the density gradient as the gradient of the density estimate (1). This gives the density gradient estimate The estimate vxpN(X) is consistent in quadratic mean, lim E{IIQ,PN(X) - V,P 0, lim Pr {sup II$,pN(X) - V,p(X)ll > E} = 0. (25) N-+m R” 34 IEEE TRANSACTIONS ON INFORMATION THEORY, JANUARY 1975 The conditions to be satisfied are listed as follows 0 lim h(N) * 0 N-C.3 lim Nh2”+2(N) = cc N-tCO 2) the characteristic function G(W) = [ exp (jWrX)V,k(X) dX JR" of V&(X) is absolutely integrable (i.e., s IIWVII dW < ~0, where W = [or - R" 3) V,p(X) is uniformly continuous. . . (26) (27) %lT) (28) (29) The proof follows from the definition of Vx&X) and the properties of characteristic functions, the details of which are given in the Appendix. Again we see that the additional power of two in the condition (27) on h(N) is present. III. SPECIAL KERNEL FUNCTIONS Having derived a general class of probability density gradient estimates and shown it to have certain desirable properties, we will now investigate specific kernel functions in order to better understand the underlying process involved in gradient estimation. Ultimately this results in a simple and very intuitive mean-shift estimate for the density gradient. A simple modification then extends this in a k- nearest-neighbor approach to gradient estimation much in the same manner as Loftsgaarden and Quesenberry’s [9] extension of kernel density estimates. A. Mean-Shift Gradient Estimates The Gaussian kernel function is perhaps the best known differentiable multivariate kernel function satisfying the conditions for asymptotic unbiasedness, consistency, and uniform consistency of the density gradient estimate. The Gaussian probability density kernel function with zero mean and identity covariance matrix is k(X) = (27~)-“/~ exp (- +XTX). (30) Taking the gradient of (30) and substituting it into (10) for Vk(X), we obtain as the estimate of the density gradient ?xPN(x) = N -’ 2 (xi _ X)(2n)-“/2h-(“+2) i=l * exp [-(X - Xi)T (y)] * (31) An intuitive interpretation of (31) is that it is essentially a weighted measure of the mean shift of the observations about the point X. The shift Xi - X of each point from X is calculated and multiplied by the weighting factor h-(“+2) . w- ‘1’ exp (-(X - Xi)T(X - Xi)/2h2). The sample mean of these weighted shifts is then taken as the gradient estimate. The same general form will result if the kernel function is of the form k(X) = g(XTX). (32) The most simple kernel with this form is k(X) = 1 c(l - XTX), XTX I; 1 o xTx > 1 (33) 3 where (34) is the normalizing constant required to make the kernel function integrate to one and l?(w) is the gamma function. This kernel function can easily be shown to satisfy all the conditions for asymptotic unbiasedness, consistency, and uniform consistency of the gradient estimate and is, in a sense, quite similar to the asymptotic optimum product kernel discussed by Epanechnikov [lo]. Substituting (33) into (lo), we obtain as the gradient estimate oxpNtx) = = where (Nh”+2)-12c C (Xi - X) xi Es&v s n n/2 Q(X) = dy= hs’c &m I-(n + 2/2) (36) is the volume of the region S,(X) = {Y: (Y - X)T(Y - X) I h2} (37) and k is the number of observations falling within S,,(X) and, therefore, the number in the sum. Equation (35) provides us with an excellent interpretation of the gradient estimation process. The last term in (35) is the sample mean shift MhtX) E t x Eqh(x) txi - x> (38) i of the observations in the small region S,(X) about X. Clearly, if the gradient or slope is zero, corresponding to a uniform density over the region S,,(X), the average mean shift would be zero due to the symmetry of the observations near X. However, with a nonzero density gradient pointing in the direction of most rapid increase of the probability density function, on the average more observations should fall along its direction than elsewhere in S,(X). Correspondingly, the average mean shift should point in that direction and have a length proportional to the magnitude of the gradient. B. Mean-Shift Normalized Gradient Estimates Examining the proportionality constant in (35), we see that it contains a term identical to the probability density estimate using a uniform kernel function over the region FUKUNAGA AND HOSTJiTLER: GRADIENT OF DENWI’Y FUNCTION 35 By taking this to the left side of (35) and using the properties of the function In y, ‘we see that the mean shift can be used as an estimate of the normalized gradient VAX) ~ = V, In p(X) P(X) $, ln pN(x) = F Mb(X). (40) This mean-shift estimate of the normalized gradient has a pleasingly simple and easily calculated expression (41). It can also be given the same intuitive interpretation that was just given in the previous section for the gradient estimate. The normalized gradient can be used in most applications in place of the regular gradient, and, as will be seen in Section IV, it has desirable properties for pattern recognition applications. This mean-shift estimate can easily be generalized to a k-nearest-neighbor approach to normalized gradient esti- mation. Letting h be replaced by the value of dk(X), the distance to the k-nearest-neighbor of X, and S,,(X) be replaced by S,,(X) = {Y : II Y - X II I dk} (42) we obtain for the k-nearest-neighbor mean-shift estimate of V, ln PQ, 9, In pN(x) = (43) Thus we have developed both a kernel and a k-nearest- neighbor approach to normalized gradient estimation. What, if any, limiting properties, such as asymptotic unbiasedness, consistency, and uniform consistency, carry over from the kernel estimates to the k-nearest-neighbor case is still an open research problem. IV. APPLICATIONS In this section we will show how gradient estimates can be applied to pattern recognition problems. A. A Gradient Clustering Algorithm From Fig. 1, we see that one method of clustering a set of observations into different classes would be to assign each observation to the nearest mode along the direction of the gradient at the observation points. To accomplish this, one could move each observation a small step in the direction of the gradient and iteratively repeat the process on the transformed observations until tight clusters result near the modes. Another approach would be to shift each observation by some amount proportional to the gradient at the observation point. This transformation approach is the one we will investigate in this section since it is in- Xl x2 x3 x4 ‘5 ‘6 Fig. 1. Gradient mode clustering. x7 tuitively appealing and will be shown to have good physical motivation behind its application. Letting x0 E x. j J’ j = 1,2;*~,iv 04 we will transform each observation recursively according to the clustering algorithm xj+l = Xj’ + aV, In p(Xj’). (45) where a is an appropriately chosen positive constant to guarantee convergence. This is the n-dimensional analog of the linear iteration technique for stepping into the roots of the equation v.dw = 0 (46) and equivalently the modes of the mixture density p(X). Although local minima are also roots of (46), it is easy to see that the algorithm (45) will move observations away from these points since the gradients point away from them. Thus after each iteration each observation will have moved closer to its parent mode or cluster center. The use of the normalized gradient V, In p(X) = V,p(X)/ p(X) in place of V,..(X) in (45) can be justified from the following four points of view. 1) The first is that in the tails of the density and near local minima where p(X) is relatively small it will be true that V, In p(X) = V,p(X)/p(X) > V,p(X). Thus the cor- responding step size for the same gradient will be greater than that near a mode. This will allow observations far from the mode or near a local minimum to move towards the mode faster than using V,p(X) alone. 2) The second reason for using the normalized gradient can be seen if a Gaussian density with mean M and identity covariance matrix is considered p(X) = (27~)~“‘~ exp [-4(X - M)T(X - M)]. (47) Then if a is taken to be one, (45) becomes Xi1 = Xj + V, In p(Xj) = Xj - (Xj - M) = M. (48) Thus for this Gaussian density the correct choice of a allows the clusters to condense to single points after one iteration. This fact gives support for its use in the general mixture density problem in which the individual mixture component densities often look Gaussian. 36 IEEl TRANSACTCONS ON INFORMATION THEORY, JANUARY 1975 3) The third reason for using the normalized gradient is that we can show convergence in a mean-square sense. In other words, if we make the transformation of random variables Y = X + aV,lnP(X) (49) then by selecting a proper a the covariance of our samples will get smaller, E{ Y - My)T(Y - My)} < E((X - MJT(X - Mx)} (50) representing a tightening up of the clusters, where My = E{ Y} and M, = E(X). (51) Letting 2 = V,lnp(X) (52) and Mz = E{Z} (53) we obtain from (49), E{P’ - MdT(Y - MY)} = E{(X - MX)T(X - Mx)} + 2aE{(X - MX)T(Z - MJ} f a2E{(Z - Mz)T(Z - Mz)}. (54) Now the ith component of Mz is, from (52), (E{ZI)i e (S,. F P(Y) dY)i = S,. c p(Y) dY i = s [p(Y)I;=-,] dY = 0. (55) p-t The last equality holds when the density function satisfies (15). Thus the second term of (54) becomes, upon using (55) and integrating by parts, E{tX - MxjTV - Mz)) = E (XT21 = l$l S,. Yi g P(Y) dY i = f1 (s p-1 CYiP(Y)I~=“-mI dY - 1 P(Y) dY] R” = -n where we have assumed (56) lim Xp(X) = 0. Ilxll+~ (57) Using (55) and (56) in (54), we obtain E{tY - MdTtY - MY)) = E{(X - MX)T(X - M,)) - 2an + a2E{ZTZ}. (58) Thus for convergence in a mean-square sense all that is required is that a be chosen small enough; in particular, from (58), 2n O

                    本文档为【The estimation of the gradient of a dens...】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

The estimation of the gradient of a dens...

你可能还喜欢