32 IERE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 1975
The Estimation of the Gradient of a Density
Function, w ith Applications in Pattern -
Recognition
KEINOSUKE FUKUNAGA, SENIOR MEMBER, IEEE, AND LARRY D. HOSTETLER, MEMBER, IEEE
Abstract-Nonparametric density gradient estimation using a gen-
eralized kernel approach is investigated. Conditions on the kernel func-
tions are derived to guarantee asymptotic unbiasedness, consistency, and
uniform consistenby of the estimates. The results are generalized to
obtain a simple mean-shift estimate that can be extended in a k-nearest-
neighbor approach. Applications of gradient estimation to pattern
recognition are presented using clustering and intrinsic dimensionality
problems, with the ultimate goal of providing further understanding of
these problems in terms of density gradients.
I. INTRODUCTION
N ONPARAMETRIC estimation of probability density functions is based on the concept that the value of a
density function at a continuity point can be estimated
using the sample observations that fall within a small
region around that point. This idea was perhaps first used
in Fix and Hodges’ [l] original work. Rosenblatt [2],
Parzen [3], and Cacoullos [4] generalized these results atid
developed the Parzen kernel class of density estimates.
These estimates were shown to be asymptotically unbiased,
consistent in a mean-square sense, and uniformly consistent
(in probability). Additional work by Nadaraya [S], later
extended by Van Ryzin [6] to n dimensions, showed strong
and uniformly strong consistency (with probability one).
Similarly, the gradient of a probability density function
can be estimated using the sample observations within a
small region. In this paper we will use these concepts and
results to derive a general kernel class of density gradient
estimates. Although Bhattacharyya [7] and Schuster [8]
considered estimating the density and all its derivatives in
the univariate case, we are interested in the multivariate
gradient, and our results follow more closely the Parzen
kernel estimates previously mentioned.
In Section II, we will develop the general form of the
kernel gradient estimates and derive conditions on the
kernel functions to assure asymptotically unbiased, con-
sistent, and uniformly consistent estimates. By examining
the form of the gradient estimate when a Gaussian kernel
is used, in Section III we will be able to develop a simple
mean-shift class of estimates. The approach uses the fact
that the expected value of the observations within a small
region about a point can be related to the density gradient
Manuscript received February 1, 1973 ; revised June 13, 1974. This
work was supported in part by the National Science Foundation under
Grant GJ-35722 and in part by Bell Laboratories, Madison, N.J.
K. Fukunaga is with the School of Electrical Engineering, Purdue
University, Lafayette, Ind. 47907.
L. D. I;Iostetler was with Bel! Laboratories, Whippany and Madison,
fj;lFe 1s now with the Sandra Laboratones, Albuquerque, N. Mex.
at that point. A simple modification results in a k-nearest-
neighbor mean-shift class of estimates. This modification
is similar to Loftsgaarden and Quesenberry’s [9] k-nearest-
neighbor density estimates, in that the number of observa-
tions within the region is fixed whereas in the previous
estimates the volume of the region was fixed.
In Section IV, applications of density gradient estimation
to pattern recognition problems are investigated. By taking
a mode-seeking approach, a recursive clustering algorithm
is developed, and its properties studied. Interpretations of
our results are presented along with examples. Applications
to data filtering and ilitrinsic dimensionality determination
are also .discussed. Section V is a summary.
II. ESTIMATION OF THE DENSITY GRADIENT
In most pattern recognition problems, very little if any
information is available as to the true probability density
function or even as to its form. Due to this lack of know-
ledge about the density, we have to rely on nonparametric
techniques to obtain density gradient estimates. A straight-
forward approach for estimating a density gradient would
be to first dbtain a differentiable nonparametric estimate of
the probability density function and then take its gradient.
The nonparametric density estimates we will use are
Cacoullos’ [4] multivariate extensions of Parzen’s [3]
univariate kernel estimates.
A. Proposed Gradient Estimates
Let X1,X,, * * - ,X, be a set of N independent and identi-
cally distributed n-dimensional random vectors. Cacoullos
[4] investigated multivariate kernel density estimators of
the form
MO = W ”)-l j$l W+(X
where k(X) is a scalar function satisfying
sup IW)l < 00
YeR”
s
Ik(Y)I dY < 00
R”
lim )I Y link(Y) = 0
IIYII+m
s
k(Y) dY = 1
R”
- xj>> (1)
(2)
(3)
(4)
(5)
and where II * II is the ordinary Euclidean norm, and R”
is the n-dimensional feature space. The parameter h, which
FUKUNAGA AND HOSTETLER: GRADIENT OF DENSITY FUNCTION 33
is a function of the sample size N, is chosen to satisfy
lim h(N) = 0 (6)
N-W
functions that satisfy
lixq, p(X) = 0,
Ilxll-~
(15)
to guarantee asymptotic unbiasedness of the estimate.
Mean-square consistency of the estimate is assured by the
condit ion
lim A%“(N) = co. (7)
N-tm
Although the condit ion (15) must be imposed on the density
function, practically all density functions satisfy this
condition. Thus
lim E{Q,p,(X)) = V,p(X)
N-CO
(16)
Uniform consistency (in probability) is assured by the at the points of continuity of V,p(X), provided that 1)
condit ion p(X) satisfies (15); 2) h(N) goes to zero as N -+ co (see
lim FIJI?“(N) = co, (8) (6)); 3) k(X) satisfies (2)--(5); and 4) sR” Iap(Y)/ayil dY < CO,
N’G) for i = 1,2; * *,n.
provided the true density p(X) is uniformly continuous.
Additional condit ions were found by Van Ryzin [6] to
assure strong and uniformly strong consistency (with
probability one).
The proof is given in the Appendix.
C. Consistency
Motivated by this general class of nonparametr ic density
estimates, we use the differentiable kernel function and
then estimate the density gradient as the gradient of the
density estimate (1). This gives the density gradient estimate
The estimate vxpN(X) is consistent in quadratic mean,
lim E{IIQ,PN(X) - V,P
0,
lim Pr {sup II$,pN(X) - V,p(X)ll > E} = 0. (25)
N-+m R”
34 IEEE TRANSACTIONS ON INFORMATION THEORY, JANUARY 1975
The conditions to be satisfied are listed as follows
0 lim h(N) * 0
N-C.3
lim Nh2”+2(N) = cc
N-tCO
2) the characteristic function
G(W) = [ exp (jWrX)V,k(X) dX
JR"
of V&(X) is absolutely integrable (i.e.,
s IIWVII dW < ~0,
where W = [or -
R"
3) V,p(X) is uniformly continuous.
. .
(26)
(27)
%lT)
(28)
(29)
The proof follows from the definition of Vx&X) and
the properties of characteristic functions, the details of
which are given in the Appendix. Again we see that the
additional power of two in the condition (27) on h(N)
is present.
III. SPECIAL KERNEL FUNCTIONS
Having derived a general class of probability density
gradient estimates and shown it to have certain desirable
properties, we will now investigate specific kernel functions
in order to better understand the underlying process involved
in gradient estimation. Ultimately this results in a simple
and very intuitive mean-shift estimate for the density
gradient. A simple modification then extends this in a k-
nearest-neighbor approach to gradient estimation much
in the same manner as Loftsgaarden and Quesenberry’s
[9] extension of kernel density estimates.
A. Mean-Shift Gradient Estimates
The Gaussian kernel function is perhaps the best known
differentiable multivariate kernel function satisfying the
conditions for asymptotic unbiasedness, consistency, and
uniform consistency of the density gradient estimate. The
Gaussian probability density kernel function with zero
mean and identity covariance matrix is
k(X) = (27~)-“/~ exp (- +XTX). (30)
Taking the gradient of (30) and substituting it into (10)
for Vk(X), we obtain as the estimate of the density gradient
?xPN(x) = N -’ 2 (xi _ X)(2n)-“/2h-(“+2)
i=l
* exp [-(X - Xi)T (y)] * (31)
An intuitive interpretation of (31) is that it is essentially
a weighted measure of the mean shift of the observations
about the point X. The shift Xi - X of each point from X
is calculated and multiplied by the weighting factor h-(“+2) .
w- ‘1’ exp (-(X - Xi)T(X - Xi)/2h2). The sample mean
of these weighted shifts is then taken as the gradient estimate.
The same general form will result if the kernel function is of
the form
k(X) = g(XTX). (32)
The most simple kernel with this form is
k(X) =
1
c(l - XTX), XTX I; 1
o xTx > 1 (33) 3
where
(34)
is the normalizing constant required to make the kernel
function integrate to one and l?(w) is the gamma function.
This kernel function can easily be shown to satisfy all the
conditions for asymptotic unbiasedness, consistency, and
uniform consistency of the gradient estimate and is, in a
sense, quite similar to the asymptotic optimum product
kernel discussed by Epanechnikov [lo].
Substituting (33) into (lo), we obtain as the gradient
estimate
oxpNtx) =
=
where
(Nh”+2)-12c C (Xi - X)
xi Es&v
s
n n/2
Q(X) = dy= hs’c
&m I-(n + 2/2)
(36)
is the volume of the region
S,(X) = {Y: (Y - X)T(Y - X) I h2} (37)
and k is the number of observations falling within S,,(X)
and, therefore, the number in the sum.
Equation (35) provides us with an excellent interpretation
of the gradient estimation process. The last term in (35)
is the sample mean shift
MhtX) E t x Eqh(x) txi - x> (38)
i
of the observations in the small region S,(X) about X.
Clearly, if the gradient or slope is zero, corresponding
to a uniform density over the region S,,(X), the average
mean shift would be zero due to the symmetry of the
observations near X. However, with a nonzero density
gradient pointing in the direction of most rapid increase
of the probability density function, on the average more
observations should fall along its direction than elsewhere
in S,(X). Correspondingly, the average mean shift should
point in that direction and have a length proportional to
the magnitude of the gradient.
B. Mean-Shift Normalized Gradient Estimates
Examining the proportionality constant in (35), we see
that it contains a term identical to the probability density
estimate using a uniform kernel function over the region
FUKUNAGA AND HOSTJiTLER: GRADIENT OF DENWI’Y FUNCTION 35
By taking this to the left side of (35) and using the properties
of the function In y, ‘we see that the mean shift can be used
as an estimate of the normalized gradient
VAX) ~ = V, In p(X)
P(X)
$, ln pN(x) = F Mb(X).
(40)
This mean-shift estimate of the normalized gradient has a
pleasingly simple and easily calculated expression (41). It
can also be given the same intuitive interpretation that
was just given in the previous section for the gradient
estimate. The normalized gradient can be used in most
applications in place of the regular gradient, and, as will
be seen in Section IV, it has desirable properties for pattern
recognition applications.
This mean-shift estimate can easily be generalized to a
k-nearest-neighbor approach to normalized gradient esti-
mation. Letting h be replaced by the value of dk(X), the
distance to the k-nearest-neighbor of X, and S,,(X) be
replaced by
S,,(X) = {Y : II Y - X II I dk} (42)
we obtain for the k-nearest-neighbor mean-shift estimate
of V, ln PQ,
9, In pN(x) = (43)
Thus we have developed both a kernel and a k-nearest-
neighbor approach to normalized gradient estimation.
What, if any, limiting properties, such as asymptotic
unbiasedness, consistency, and uniform consistency, carry
over from the kernel estimates to the k-nearest-neighbor
case is still an open research problem.
IV. APPLICATIONS
In this section we will show how gradient estimates can be
applied to pattern recognition problems.
A. A Gradient Clustering Algorithm
From Fig. 1, we see that one method of clustering a
set of observations into different classes would be to assign
each observation to the nearest mode along the direction
of the gradient at the observation points. To accomplish
this, one could move each observation a small step in the
direction of the gradient and iteratively repeat the process
on the transformed observations until tight clusters result
near the modes. Another approach would be to shift each
observation by some amount proportional to the gradient
at the observation point. This transformation approach
is the one we will investigate in this section since it is in-
Xl x2 x3 x4 ‘5 ‘6
Fig. 1. Gradient mode clustering.
x7
tuitively appealing and will be shown to have good physical
motivation behind its application.
Letting
x0 E x. j J’ j = 1,2;*~,iv 04
we will transform each observation recursively according
to the clustering algorithm
xj+l = Xj’ + aV, In p(Xj’). (45)
where a is an appropriately chosen positive constant to
guarantee convergence.
This is the n-dimensional analog of the linear iteration
technique for stepping into the roots of the equation
v.dw = 0 (46)
and equivalently the modes of the mixture density p(X).
Although local minima are also roots of (46), it is easy
to see that the algorithm (45) will move observations away
from these points since the gradients point away from them.
Thus after each iteration each observation will have moved
closer to its parent mode or cluster center.
The use of the normalized gradient V, In p(X) = V,p(X)/
p(X) in place of V,..(X) in (45) can be justified from the
following four points of view.
1) The first is that in the tails of the density and near
local minima where p(X) is relatively small it will be true
that V, In p(X) = V,p(X)/p(X) > V,p(X). Thus the cor-
responding step size for the same gradient will be greater
than that near a mode. This will allow observations far
from the mode or near a local minimum to move towards
the mode faster than using V,p(X) alone.
2) The second reason for using the normalized gradient
can be seen if a Gaussian density with mean M and identity
covariance matrix is considered
p(X) = (27~)~“‘~ exp [-4(X - M)T(X - M)]. (47)
Then if a is taken to be one, (45) becomes
Xi1 = Xj + V, In p(Xj) = Xj - (Xj - M) = M. (48)
Thus for this Gaussian density the correct choice of a
allows the clusters to condense to single points after one
iteration. This fact gives support for its use in the general
mixture density problem in which the individual mixture
component densities often look Gaussian.
36 IEEl TRANSACTCONS ON INFORMATION THEORY, JANUARY 1975
3) The third reason for using the normalized gradient is
that we can show convergence in a mean-square sense. In
other words, if we make the transformation of random
variables
Y = X + aV,lnP(X) (49)
then by selecting a proper a the covariance of our samples
will get smaller,
E{ Y - My)T(Y - My)} < E((X - MJT(X - Mx)}
(50)
representing a tightening up of the clusters, where
My = E{ Y} and M, = E(X). (51)
Letting
2 = V,lnp(X) (52)
and
Mz = E{Z} (53)
we obtain from (49),
E{P’ - MdT(Y - MY)}
= E{(X - MX)T(X - Mx)}
+ 2aE{(X - MX)T(Z - MJ}
f a2E{(Z - Mz)T(Z - Mz)}. (54)
Now the ith component of Mz is, from (52),
(E{ZI)i e (S,. F P(Y) dY)i = S,. c p(Y) dY
i
=
s
[p(Y)I;=-,] dY = 0. (55) p-t
The last equality holds when the density function satisfies
(15). Thus the second term of (54) becomes, upon using (55)
and integrating by parts,
E{tX - MxjTV - Mz))
= E (XT21 = l$l S,. Yi g P(Y) dY
i
= f1 (s
p-1
CYiP(Y)I~=“-mI dY - 1 P(Y) dY]
R”
= -n
where we have assumed
(56)
lim Xp(X) = 0.
Ilxll+~
(57)
Using (55) and (56) in (54), we obtain
E{tY - MdTtY - MY))
= E{(X - MX)T(X - M,)) - 2an + a2E{ZTZ}. (58)
Thus for convergence in a mean-square sense all that is
required is that a be chosen small enough; in particular,
from (58),
2n O
本文档为【The estimation of the gradient of a dens...】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。