3 Nonparametric Regression
3.1 Nadaraya-Watson Regression
Let the data be (yi; Xi) where yi is real-valued and Xi is a q-vector, and assume that all are
continuously distributed with a joint density f(y; x): Let f (y j x) = f(y; x)=f(x) be the conditional
density of yi given Xi where f(x) =
R
f (y; x) dy is the marginal density of Xi: The regression
function for yi on Xi is
g(x) = E (yi j Xi = x) :
We want to estimate this nonparametrically, with minimal assumptions about g:
If we had a large number of observations where Xi exactly equals x; we could take the average
value of the y0is for these observations. But since Xi is continously distributed, we wont observe
multiple observations equalling the same value.
The solution is to consider a neighborhood of x; and note that if Xi has a positive density at
x; we should observe a number of observations in this neighborhood, and this number is increasing
with the sample size. If the regression function g(x) is continuous, it should be reasonable constant
over this neighborhood (if it is small enough), so we can take the average the the yi values for
these observations. The obvious trick is is to determine the size of the neighborhood to trade o¤
the variation in g(x) over the neighborhood (estimation bias) against the number of observations
in the neighborhood (estimation variance).
we will observe a large number of Xi in any given neighborhood of xi:
Take the one-regressor case q = 1.
Let a neighborhood of x be x � h for some bandwidth h > 0: Then a simple nonparametric
estimator of g(x) is the average value of the y0is for the observations i such that Xi is in this
neighborhood, that is,
g^(x) =
Pn
i=1 1 (jXi � xj � h) yiPn
i=1 1 (jXi � xj � h)
=
Pn
i=1 k
�
Xi � x
h
�
yiPn
i=1 k
�
Xi � x
h
�
where k(u) is the uniform kernel.
In general, the kernel regression estimator takes this form, where k(u) is a kernel function. It
is known as the Nadaraya-Watson estimator, or local constant estimator.
When q > 1 the estimator is
g^(x) =
Pn
i=1K
�
H�1 (Xi � x)
�
yiPn
i=1K (H
�1 (Xi � x))
25
where K(u) is a multivariate kernel function.
As an alternative motivation, note that the regression function can be written as
g(x) =
R
yf (y; x) dy
f(x)
where f(x) =
R
f (y; x) dy is the marginal density of Xi: Now consider estimating g by replacing
the density functions by the nonparametric estimates we have already studied. That is,
f^ (y; x) =
1
n jHjhy
nX
i=1
K
�
H�1 (Xi � x)
�
k
�
yi � y
hy
�
where hy is a bandwidth for smoothing in the y-direction. Then
f^ (x) =
Z
f^ (y; x) dy
=
1
n jHjhy
nX
i=1
K
�
H�1 (Xi � x)
� Z
k
�
yi � y
hy
�
dy
=
1
n jHj
nX
i=1
K
�
H�1 (Xi � x)
�
and Z
yf^ (y; x) dy =
1
n jHjhy
nX
i=1
K
�
H�1 (Xi � x)
� Z
yk
�
yi � y
hy
�
dy
=
1
n jHj
nX
i=1
K
�
H�1 (Xi � x)
�
yi
and thus taking the ratio
g^(x) =
1
n jHj
Pn
i=1K
�
H�1 (Xi � x)
�
yi
1
n jHj
Pn
i=1K (H
�1 (Xi � x))
=
Pn
i=1K
�
H�1 (Xi � x)
�
yiPn
i=1K (H
�1 (Xi � x))
again obtaining the Nadaraya-Watson estimator. Note that the bandwidth hy has disappeared.
The estimator is ill-de
ned for values of x such that f^ (x) � 0: This can occur in the tails of
the distribution of Xi: As higher-order kernels can yield f^ (x) < 0; many authors suggest using
only second-order kernels for regression. I am unsure if this is a correct recommendation. If a
higher-order kernel is used and for some x we
nd f^ (x) < 0; this suggests that the data is so sparse
in that neighborhood of x that it is unreasonable to estimate the regression function. It does not
26
require the abandonment of higher-order kernels. We will follow convention and typically assume
that k is second order (� = 2) for our presentation.
3.2 Asymptotic Distribution
We analyze the asymptotic distribution of the NW estimator g^(x) for the case q = 1:
Since E (yi j Xi) = g(Xi); we can write the regression equation as yi = g(Xi) + ei where
E (ei j Xi) = 0: We can also write the conditional variance as E
�
e2i j Xi = x
�
= �2(x):
Fix x: Note that
yi = g(Xi) + ei
= g(x) + (g(Xi)� g(x)) + ei
and therefore
1
nh
nX
i=1
k
�
Xi � x
h
�
yi =
1
nh
nX
i=1
k
�
Xi � x
h
�
g(x)
+
1
nh
nX
i=1
k
�
Xi � x
h
�
(g(Xi)� g(x))
+
1
nh
nX
i=1
k
�
Xi � x
h
�
ei
= f^(x)g(x) + m^1(x) + m^2(x);
say. It follows that
g^(x) = g(x) +
m^1(x)
f^(x)
+
m^2(x)
f^(x)
:
We now analyze the asymptotic distributions of the components m^1(x) and m^2(x):
First take m^2(x): Since E (ei j Xi) = 0 it follows that E
�
k
�
Xi � x
h
�
ei
�
= 0 and thus
E (m^2(x)) = 0: Its variance is
var (m^2(x)) =
1
nh2
E
�
k
�
Xi � x
h
�
ei
�2
=
1
nh2
E
k
�
Xi � x
h
�2
�2(Xi)
!
(by conditioning), and this is
1
nh2
Z
k
�
z � x
h
�2
�2(z)f(z)dz
27
(where f(z) is the density of Xi): Making the change of variables, this equals
1
nh
Z
k (u)2 �2(x+ hu)f(x+ hu)du =
1
nh
Z
k (u)2 �2(x)f(x)du+ o
�
1
nh
�
=
R(k)�2(x)f(x)
nh
+ o
�
1
nh
�
if �2(x)f(x) are smooth in x: We can even apply the CLT to obtain that as h! 0 and nh!1;
p
nhm^2(x)!d N
�
0; R(k)�2(x)f(x)
�
:
Now take m^1(x): Its mean is
Em^1(x) =
1
h
Ek
�
Xi � x
h
�
(g(Xi)� g(x))
=
1
h
Z
k
�
z � x
h
�
(g(z)� g(x)) f(z)dz
=
Z
k (u) (g(x+ hu)� g(x)) f(x+ hu)du
Now expanding both g and f in Taylor expansions, this equals, up to o(h2)Z
k (u)
�
uhg(1)(x) +
u2h2
2
g(2)(x)
��
f(x) + uhf (1)(x)
�
du
=
�Z
k (u)udu
�
hg(1)(x)f(x)
+
�Z
k (u)u2du
�
h2
�
1
2
g(2)(x)f(x) + g(1)(x)f (1)(x)
�
= h2�2
�
1
2
g(2)(x)f(x) + g(1)(x)f (1)(x)
�
= h2�2B(x)f(x);
where
B(x) =
1
2
g(2)(x) + f(x)�1g(1)(x)f (1)(x)
(If k is a higher-order kernel, this is O(h�) instead.) A similar expansion shows that var(m^1(x)) =
O
�
h2
nh
�
which is of smaller order than O
�
1
nh
�
: Thus
p
nh
�
m^1(x)� h2�2B(x)f(x)
�!p 0
and since f^(x)!p f(x);
p
nh
m^1(x)
f^(x)
� h2�2B(x)
!
!p 0
28
In summary, we have
p
nh
�
g^(x)� g(x)� h2�2B(x)
�
=
p
nh
m^1(x)
f^(x)
� h2�2B(x)
!
+
p
nhm^2(x)
f^(x)
d�! N
�
0; R(k)�2(x)f(x)
�
f(x)
= N
�
0;
R(k)�2(x)
f(x)
�
When Xi is a q-vector, the result is
p
n jHj
0@g^(x)� g(x)� �2 qX
j=1
h2jBj(x)
1A d�!= N �0; R(k)q�2(x)
f(x)
�
where
Bj(x) =
1
2
@2
@x2j
g(x) + f(x)�1
@
@xj
g(x)
@
@xj
f(x):
3.3 Mean Squared Error
The AMSE of the NW estimator g^ (x) is
AMSE (g^(x)) = �22
0@ qX
j=1
h2jBj(x)
1A2 + R(k)q�2(x)
n jHj f(x)
A weighted integrated MSE takes the form
WIMSE =
Z
AMSE (g^(x)) f(x)M(x) (dx)
= �22
Z 0@ qX
j=1
h2jBj(x)
1A2 f(x)M(x) (dx) + R(k)q R �2(x)M(x)dx
nh1h2 � � �hq
where M(x) is a weight function. Possible choices include M(x) = f(x) and M(x) = 1 (f(x) � �)
for some � > 0: The AMSE nees the weighting otherwise the integral will not exist.
3.4 Observations about the Asymptotic Distribution
In univariate regression, the optimal rate for the bandwidth is h0 = Cn�1=5 with mean-squared
convergence O(n�2=5): In the multiple regressor case, the optimal bandwidths are hj = Cn�1=(q+4)
with convergence rate O
�
n�2=(q+4)
�
: This is the same as for univariate and q-variate density esti-
mation.
If higher-order kernels are used, the optimal bandwidth and convergence rates are again the
same as for density estimation.
29
The asymptotic distribution depends on the kernel through R(k) and �2: The optimal kernel
minimizes R(k); the same as for density estimation. Thus the Epanechnikov family is optimal for
regression.
As the WIMSE depends on the
rst and second derivatives of the mean function g(x); the
optimal bandwidth will depend on these values. When the derivative functions Bj(x) are larger,
the optimal bandwidths are smaller, to capture the uctuations in the function g(x): When the
derivatives are smaller, optimal bandwidths are larger, smoother more, and thus reducing the
estimation variance.
For nonparametric regression, reference bandwidths are not natural. This is because there is
no natural reference g(x) which dictates the
rst and second derivative. Many authors use the
rule-of-thumb bandwidth for density estimation (for the regressors Xi) but there is absolutely no
justi
cation for this choice. The theory shows that the optimal bandwidth depends on the curvature
in the conditional mean g(x); and this is independent of the marginal density f(x) for which the
rule-of-thumb is designed.
3.5 Limitations of the NW estimator
Suppose that q = 1 and the true conditional mean is linear g(x) = � + x�: As this is a very
simple situation, we might expect that a nonparametric estimator will work reasonably well. This
is not necessarily the case with the NW estimator.
Take the absolutely simplest case that there is not regression error, i.e. yi = �+Xi� identically.
A simple scatter plot would reveal the deterministic relationship. How will NW perform?
The answer depends on the marginal distribution of the xi: If they are not spaced at uniform
distances, then g^(x) 6= g(x): The NW estimator applied to purely linear data yields a nonlinear
output!
One way to see the source of the problem is to consider the problem of nonparametrically
estimating E (Xi � x j Xi = x) = 0: The numerator of the NW estimator of the expectation is
nX
i=1
k
�
Xi � x
h
�
(Xi � x)
but this is (generally) non-zero.
Can the problem by resolved by choice of bandwidth? Actually, it can make things worse. As
the bandwidth increases (to increase smoothing) then g^(x) collapses to a at function. Recall that
the NW estimator is also called the local constant estimator. It is approximating the regression
function by a (local) constant. As smoothing increases, the estimator simpli
es to a constant, not
to a linear function.
Another limitation of the NW estimator occurs at the edges of the support. Again consider the
case q = 1: For a value of x � min (Xi) ; then the NW estimator g^(x) is an average only of yi values
for obsevations to the right of x: If g(x) is positively sloped, the NW estimator will be upward
biased. In fact, the estimator is inconsistent at the boundary. This e¤ectively restricts application
30
of the NW estimator to values of x in the interior of the support of the regressors, and this may
too limiting.
3.6 Local Linear Estimator
We started this chapter by motivating the NW estimator at x by taking an average of the
yi values for observations such that Xi are in a neighborhood of x: This is a local constant ap-
proximation. Instead, we could
t a linear regression line through the observations in the same
neighborhood. If we use a weighting function, this is called the local linear (LL) estimator, and it
is quite popular in the recent nonparametric regression literature.
The idea is to
t the local model
yi = �+ �
0 (Xi � x) + ei
The reason for using the regressor Xi � x rather than Xi is so that the intercept equals g(x) =
E (yi j Xi = x) : Once we get the estimates �^(x); �^(x); we then set g^(x) = �^(x): Furthermore, we
can use �^(x) to estimate of
@
@x
g(x).
If we simply
t a linear regression through observations such that jXi � xj � h; this can be
written as
min
�;�
nX
i=1
�
yi � �� �0 (Xi � x)
�2
1 (jXi � xj � h)
or setting
Zi =
1
Xi � x
!
we have the explicit expression
�^(x)
�^(x)
!
=
nX
i=1
1 (jXi � xj � h)ZiZ 0i
!�1 nX
i=1
1 (jXi � xj � h)Ziyi
!
=
nX
i=1
K
�
H�1 (Xi � x)
�
ZiZ
0
i
!�1 nX
i=1
K
�
H�1 (Xi � x)
�
Ziyi
!
where the second line is valid for any (multivariate) kernel funtion. This is a (locally) weighted
regression of yi on Xi: Algebraically, this equals a WLS estimator.
In contrast to the NW estimator, the LL estimator preserves linear data. That is, if the true
data lie on a line yi = � +X 0i�; then for any sub-sample, a local linear regression
ts exactly, so
g^(x) = g(x): In fact, we will see that the distribution of the LL estimator is invariant to the
rst
derivative of g: It has zero bias when the true regression is linear.
As h!1 (smoothing is increased), the LL estimator collapses to the OLS regression of yi on
Xi: In this sense LL is a natural nonparametric generalization of least-squares regression.
31
The LL estimator also has much better properties at the boundard than the NW estimator.
Intuitively, even if x is at the boundard of the regression support, as the local linear estimator
ts
a (weighted) least-squares line through data near the boundary, if the true relationship is linear
this estimator will be unbiased.
Deriving the asymptotic distribution of the LL estimator is similar to that of the NW estimator,
but much more involved, so I will not present the argument here. It has the following asymptotic
distribution. Let g^(x) = �^(x). Then
p
n jHj
0@g^(x)� g(x)� �2 qX
j=1
h2j
1
2
@2
@x2j
g(x)
1A d�! N �0; R(k)q�2(x)
f(x)
�
This is quite similar to the distribution for the NW estimator, with one important di¤erence that
the bias term has been simpli
ed. The term involving f(x)�1
@
@xj
g(x)
@
@xj
f(x) has been eliminated.
The asymptotic variance is unchanged.
Strictly speaking, we cannot rank the AMSE of the NW versus the LL estimator. While a bias
term has been eliminated, it is possible that the two terms have opposite signs and thereby cancel
somewhat. However, the standard intuition is that a simpli
ed bias term suggests reduced bias in
practice. The AMSE of the LL estimator only depends on the second derivative of g(x); while that
of the NW estimator also depends on the
rst derivative. We expect this to translate into reduced
bias.
Magically, this does not come as a cost in the asymptotic variance. These facts have led the
statistics literature to focus on the LL estimator as the preferred approach.
While I agree with this general view, a side not of caution is warrented. Simple simulation
experiments show that the LL estimator does not always beat the NW estimator. When the
regression function g(x) is quite at, the NW estimator does better. When the regression function
is steeper and curvier, the LL estimator tends to do better. The explanation is that while the two
have identical asymptotic variance formulae, in
nite samples the NW estimator tends to have a
smaller variance. This gives it an advantage in contexts where estimation bias is low (such as when
the regression function is at). The reason why I mention this is that in many economic contexts,
it is believed that the regression function may be quite at with respect to many regressors. In this
context it may be better to use NW rather than LL.
3.7 Local Polynomial Estimation
If LL improves on NW, why not local polynomial? The intuition is quite straightforward.
Rather than
tting a local linear equation, we can
t a local quadratic, cubic, or polynomial of
arbitrary order.
Let p denote the order of the local polynomial. Thus p = 0 is the NW estimator, p = 1 is the
LL estimator, and p = 2 is a local quadratic.
Interestingly, the asymptotic behavior di¤ers depending on whether p is even or odd.
32
When p is odd (e.g. LL), then the bias is of order O(hp+1) and is proportional to g(p+1)(x)
When p is even (e.g. NW or local quadratic), then the bias is of orderO(hp+2) but is proportional
to g(p+2)(x) and g(p+1)(x)f (1)(x)=f(x):
In either case, the variance is O
�
1
n jHj
�
What happens is that by increasing the polynomial order from even to the next odd number,
the order of the bias does change, but the bias simpli
es.
By increasing the polynomial order from odd to the next even number, the bias order decreases.
This e¤ect is analogous to the bias reduction achieved by higher-order kernels.
While local linear estimation is gaining popularity in econometric practice, local polynomial
methods are not typically used. I believe this is mostly because typical econometric applications
have q > 1; and it is di¢ cult to apply polynomial methods in this context.
3.8 Weighted Nadaraya-Watson Estimator
In the context of conditional distribution estimation, Hall et. al. (1999, JASA) and Cai (2002,
ET) proposed a weighted NW estimator with the same asymptotic distribution as the LL estimator.
This is discussed on pp. 187-188 of Li-Racine.
The estimator takes the form
g^(x) =
Pn
i=1 pi(x)K
�
H�1 (Xi � x)
�
yiPn
i=1 pi(x)K (H
�1 (Xi � x))
where pi(x) are weights. The weights satisfy
pi(x) � 0
nX
i=1
pi(x) = 1
nX
i=1
pi(x)K
�
H�1 (Xi � x)
�
(Xi � x) = 0
The
rst two requirements set up the pi(x) as weights. The third equality requires the weights to
force the kernel function to satisfy local linearity.
The weights are determined by empirical likelihood. Speci
cally, for each x; you maximizePn
i=1 ln pi(x) subject to the above constraints. The solutions take the form
pi(x) =
1
n
�
1 + �0 (Xi � x)K (H�1 (Xi � x))
�
where � is a Lagrange multiplier and is found by numerical optimization. For details about empirical
likelihood, see my Econometrics lecture notes.
The above authors show that the estimator g^(x) has the same asymptotic distribution as LL.
When the dependent variable is non-negative, yi � 0; the standard and weighted NW estimators
33
also satisfy g^(x) � 0: This is an advantage since it is obvious in this case that g(x) � 0. In contrast,
the LL estimator is not necessarily non-negative.
An important disadvantage of the weighted NW estimator is that it is considerably more com-
putationally cumbersome than the LL estimator. The EL weights must be found separately for
each x at which g^(x) is calculated.
3.9 Residual and Fit
Given any nonparametric estimator g^(x) we can de
ne the residual e^i = yi�g^ (Xi). Numerically,
this requires computing the regression estimate at each observation. For example, in the case of
NW estimation,
e^i = yi �
Pn
j=1K
�
H�1 (Xj �Xi)
�
yjPn
j=1K (H
�1 (Xj �Xi))
From e^i we can compute many conventional regression statistics. For example, the residual
variance estimate is n�1
Pn
i=1 e^
2
i ; and R
2 has the standard formula.
One cautionary remark is that since the convergence rate for g^ is slower than n�1=2; the same
is true for many statistics computed from e^i:
We can also compute the leave-one-out residuals
e^i;i�1 = yi � g^�i (Xi)
= yi �
P
j 6=iK
�
H�1 (Xj �Xi)
�
yjP
j 6=iK (H�1 (Xj �Xi))
3.10 Cross-Validation
For NW, LL and local polynomial regression, it is critical to have a reliable data-dependent rule
for bandwidth selection. One popular and practical approach is cross-validation. The motivation
starts by considering the sum-of-squared errors
Pn
i=1 e^
2
i : One could think about picking h to min-
imize this quantity. But this is analogous to picking the number of regressors in least-squares by
minimizing the sum-of-squared errors. In that context the solution is to pick all possible regressors,
as the sum-of-squared errors is monotonically decreasing in the number of regressors. The same is
true in nonparametric regression. As the bandwidth h decreases, the in-sample
tof the model
improves and
Pn
i=1 e^
2
i decreases. As h shrinks to zero, g^ (Xi) collapses on yi to obtain perfect
t,
e^i shrinks to zero and so does
Pn
i=1 e^
2
i : It is clearly a poor choice to pick h based on this criterion.
Instead, we can consider the sum-of-squared leave-one-out residuals
Pn
i=1 e^
2
i;i�1: This is a rea-
sonable criterion. Because the quality of g^ (Xi) can be quite poor for tail values of Xi; it may
be more sensible to use a trimmed verion of the sum of squared residuals, and this is called the
cross-validation criterion
CV (h) =
1
n
nX
i=1
e^2i;i�1M (Xi)
34
(We have also divided by sample size for convenience.) The funtion M(x) is a trimming function,
the same as introduced in the de
nition of WIMSE earlier.
The cross-validation bandwidth h is that which minimizes CV (h): As in the case of density
estimation, this needs to be done numerically.
To see that the CV criterion is sensible, let us calculate its expectation. Since yi = g (Xi) + ei;
E (CV (h)) = E
�
(ei + g (Xi)� g^�i (Xi))2M (Xi
本文档为【NonParametrics2】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。