NonParametrics2

NonParametrics2 3 Nonparametric Regression 3.1 Nadaraya-Watson Regression Let the data be (yi; Xi) where yi is real-valued and Xi is a q-vector, and assume that all are continuously distributed with a joint density f(y; x): Let f (y j x) = f(y; x)=f(x) be the conditional ...

3 Nonparametric Regression 3.1 Nadaraya-Watson Regression Let the data be (yi; Xi) where yi is real-valued and Xi is a q-vector, and assume that all are continuously distributed with a joint density f(y; x): Let f (y j x) = f(y; x)=f(x) be the conditional density of yi given Xi where f(x) = R f (y; x) dy is the marginal density of Xi: The regression function for yi on Xi is g(x) = E (yi j Xi = x) : We want to estimate this nonparametrically, with minimal assumptions about g: If we had a large number of observations where Xi exactly equals x; we could take the average value of the y0is for these observations. But since Xi is continously distributed, we wont observe multiple observations equalling the same value. The solution is to consider a neighborhood of x; and note that if Xi has a positive density at x; we should observe a number of observations in this neighborhood, and this number is increasing with the sample size. If the regression function g(x) is continuous, it should be reasonable constant over this neighborhood (if it is small enough), so we can take the average the the yi values for these observations. The obvious trick is is to determine the size of the neighborhood to trade o¤ the variation in g(x) over the neighborhood (estimation bias) against the number of observations in the neighborhood (estimation variance). we will observe a large number of Xi in any given neighborhood of xi: Take the one-regressor case q = 1. Let a neighborhood of x be x � h for some bandwidth h > 0: Then a simple nonparametric estimator of g(x) is the average value of the y0is for the observations i such that Xi is in this neighborhood, that is, g^(x) = Pn i=1 1 (jXi � xj � h) yiPn i=1 1 (jXi � xj � h) = Pn i=1 k � Xi � x h � yiPn i=1 k � Xi � x h � where k(u) is the uniform kernel. In general, the kernel regression estimator takes this form, where k(u) is a kernel function. It is known as the Nadaraya-Watson estimator, or local constant estimator. When q > 1 the estimator is g^(x) = Pn i=1K � H�1 (Xi � x) � yiPn i=1K (H �1 (Xi � x)) 25 where K(u) is a multivariate kernel function. As an alternative motivation, note that the regression function can be written as g(x) = R yf (y; x) dy f(x) where f(x) = R f (y; x) dy is the marginal density of Xi: Now consider estimating g by replacing the density functions by the nonparametric estimates we have already studied. That is, f^ (y; x) = 1 n jHjhy nX i=1 K � H�1 (Xi � x) � k � yi � y hy � where hy is a bandwidth for smoothing in the y-direction. Then f^ (x) = Z f^ (y; x) dy = 1 n jHjhy nX i=1 K � H�1 (Xi � x) � Z k � yi � y hy � dy = 1 n jHj nX i=1 K � H�1 (Xi � x) � and Z yf^ (y; x) dy = 1 n jHjhy nX i=1 K � H�1 (Xi � x) � Z yk � yi � y hy � dy = 1 n jHj nX i=1 K � H�1 (Xi � x) � yi and thus taking the ratio g^(x) = 1 n jHj Pn i=1K � H�1 (Xi � x) � yi 1 n jHj Pn i=1K (H �1 (Xi � x)) = Pn i=1K � H�1 (Xi � x) � yiPn i=1K (H �1 (Xi � x)) again obtaining the Nadaraya-Watson estimator. Note that the bandwidth hy has disappeared. The estimator is ill-dened for values of x such that f^ (x) � 0: This can occur in the tails of the distribution of Xi: As higher-order kernels can yield f^ (x) < 0; many authors suggest using only second-order kernels for regression. I am unsure if this is a correct recommendation. If a higher-order kernel is used and for some x we nd f^ (x) < 0; this suggests that the data is so sparse in that neighborhood of x that it is unreasonable to estimate the regression function. It does not 26 require the abandonment of higher-order kernels. We will follow convention and typically assume that k is second order (� = 2) for our presentation. 3.2 Asymptotic Distribution We analyze the asymptotic distribution of the NW estimator g^(x) for the case q = 1: Since E (yi j Xi) = g(Xi); we can write the regression equation as yi = g(Xi) + ei where E (ei j Xi) = 0: We can also write the conditional variance as E � e2i j Xi = x � = �2(x): Fix x: Note that yi = g(Xi) + ei = g(x) + (g(Xi)� g(x)) + ei and therefore 1 nh nX i=1 k � Xi � x h � yi = 1 nh nX i=1 k � Xi � x h � g(x) + 1 nh nX i=1 k � Xi � x h � (g(Xi)� g(x)) + 1 nh nX i=1 k � Xi � x h � ei = f^(x)g(x) + m^1(x) + m^2(x); say. It follows that g^(x) = g(x) + m^1(x) f^(x) + m^2(x) f^(x) : We now analyze the asymptotic distributions of the components m^1(x) and m^2(x): First take m^2(x): Since E (ei j Xi) = 0 it follows that E � k � Xi � x h � ei � = 0 and thus E (m^2(x)) = 0: Its variance is var (m^2(x)) = 1 nh2 E � k � Xi � x h � ei �2 = 1 nh2 E k � Xi � x h �2 �2(Xi) ! (by conditioning), and this is 1 nh2 Z k � z � x h �2 �2(z)f(z)dz 27 (where f(z) is the density of Xi): Making the change of variables, this equals 1 nh Z k (u)2 �2(x+ hu)f(x+ hu)du = 1 nh Z k (u)2 �2(x)f(x)du+ o � 1 nh � = R(k)�2(x)f(x) nh + o � 1 nh � if �2(x)f(x) are smooth in x: We can even apply the CLT to obtain that as h! 0 and nh!1; p nhm^2(x)!d N � 0; R(k)�2(x)f(x) � : Now take m^1(x): Its mean is Em^1(x) = 1 h Ek � Xi � x h � (g(Xi)� g(x)) = 1 h Z k � z � x h � (g(z)� g(x)) f(z)dz = Z k (u) (g(x+ hu)� g(x)) f(x+ hu)du Now expanding both g and f in Taylor expansions, this equals, up to o(h2)Z k (u) � uhg(1)(x) + u2h2 2 g(2)(x) �� f(x) + uhf (1)(x) � du = �Z k (u)udu � hg(1)(x)f(x) + �Z k (u)u2du � h2 � 1 2 g(2)(x)f(x) + g(1)(x)f (1)(x) � = h2�2 � 1 2 g(2)(x)f(x) + g(1)(x)f (1)(x) � = h2�2B(x)f(x); where B(x) = 1 2 g(2)(x) + f(x)�1g(1)(x)f (1)(x) (If k is a higher-order kernel, this is O(h�) instead.) A similar expansion shows that var(m^1(x)) = O � h2 nh � which is of smaller order than O � 1 nh � : Thus p nh � m^1(x)� h2�2B(x)f(x) �!p 0 and since f^(x)!p f(x); p nh m^1(x) f^(x) � h2�2B(x) ! !p 0 28 In summary, we have p nh � g^(x)� g(x)� h2�2B(x) � = p nh m^1(x) f^(x) � h2�2B(x) ! + p nhm^2(x) f^(x) d�! N � 0; R(k)�2(x)f(x) � f(x) = N � 0; R(k)�2(x) f(x) � When Xi is a q-vector, the result is p n jHj 0@g^(x)� g(x)� �2 qX j=1 h2jBj(x) 1A d�!= N �0; R(k)q�2(x) f(x) � where Bj(x) = 1 2 @2 @x2j g(x) + f(x)�1 @ @xj g(x) @ @xj f(x): 3.3 Mean Squared Error The AMSE of the NW estimator g^ (x) is AMSE (g^(x)) = �22 0@ qX j=1 h2jBj(x) 1A2 + R(k)q�2(x) n jHj f(x) A weighted integrated MSE takes the form WIMSE = Z AMSE (g^(x)) f(x)M(x) (dx) = �22 Z 0@ qX j=1 h2jBj(x) 1A2 f(x)M(x) (dx) + R(k)q R �2(x)M(x)dx nh1h2 � � �hq where M(x) is a weight function. Possible choices include M(x) = f(x) and M(x) = 1 (f(x) � �) for some � > 0: The AMSE nees the weighting otherwise the integral will not exist. 3.4 Observations about the Asymptotic Distribution In univariate regression, the optimal rate for the bandwidth is h0 = Cn�1=5 with mean-squared convergence O(n�2=5): In the multiple regressor case, the optimal bandwidths are hj = Cn�1=(q+4) with convergence rate O � n�2=(q+4) � : This is the same as for univariate and q-variate density esti- mation. If higher-order kernels are used, the optimal bandwidth and convergence rates are again the same as for density estimation. 29 The asymptotic distribution depends on the kernel through R(k) and �2: The optimal kernel minimizes R(k); the same as for density estimation. Thus the Epanechnikov family is optimal for regression. As the WIMSE depends on the rst and second derivatives of the mean function g(x); the optimal bandwidth will depend on these values. When the derivative functions Bj(x) are larger, the optimal bandwidths are smaller, to capture the uctuations in the function g(x): When the derivatives are smaller, optimal bandwidths are larger, smoother more, and thus reducing the estimation variance. For nonparametric regression, reference bandwidths are not natural. This is because there is no natural reference g(x) which dictates the rst and second derivative. Many authors use the rule-of-thumb bandwidth for density estimation (for the regressors Xi) but there is absolutely no justication for this choice. The theory shows that the optimal bandwidth depends on the curvature in the conditional mean g(x); and this is independent of the marginal density f(x) for which the rule-of-thumb is designed. 3.5 Limitations of the NW estimator Suppose that q = 1 and the true conditional mean is linear g(x) = � + x�: As this is a very simple situation, we might expect that a nonparametric estimator will work reasonably well. This is not necessarily the case with the NW estimator. Take the absolutely simplest case that there is not regression error, i.e. yi = �+Xi� identically. A simple scatter plot would reveal the deterministic relationship. How will NW perform? The answer depends on the marginal distribution of the xi: If they are not spaced at uniform distances, then g^(x) 6= g(x): The NW estimator applied to purely linear data yields a nonlinear output! One way to see the source of the problem is to consider the problem of nonparametrically estimating E (Xi � x j Xi = x) = 0: The numerator of the NW estimator of the expectation is nX i=1 k � Xi � x h � (Xi � x) but this is (generally) non-zero. Can the problem by resolved by choice of bandwidth? Actually, it can make things worse. As the bandwidth increases (to increase smoothing) then g^(x) collapses to a at function. Recall that the NW estimator is also called the local constant estimator. It is approximating the regression function by a (local) constant. As smoothing increases, the estimator simplies to a constant, not to a linear function. Another limitation of the NW estimator occurs at the edges of the support. Again consider the case q = 1: For a value of x � min (Xi) ; then the NW estimator g^(x) is an average only of yi values for obsevations to the right of x: If g(x) is positively sloped, the NW estimator will be upward biased. In fact, the estimator is inconsistent at the boundary. This e¤ectively restricts application 30 of the NW estimator to values of x in the interior of the support of the regressors, and this may too limiting. 3.6 Local Linear Estimator We started this chapter by motivating the NW estimator at x by taking an average of the yi values for observations such that Xi are in a neighborhood of x: This is a local constant ap- proximation. Instead, we could t a linear regression line through the observations in the same neighborhood. If we use a weighting function, this is called the local linear (LL) estimator, and it is quite popular in the recent nonparametric regression literature. The idea is to t the local model yi = �+ � 0 (Xi � x) + ei The reason for using the regressor Xi � x rather than Xi is so that the intercept equals g(x) = E (yi j Xi = x) : Once we get the estimates �^(x); �^(x); we then set g^(x) = �^(x): Furthermore, we can use �^(x) to estimate of @ @x g(x). If we simply t a linear regression through observations such that jXi � xj � h; this can be written as min �;� nX i=1 � yi � �� 0 (Xi � x) �2 1 (jXi � xj � h) or setting Zi = 1 Xi � x ! we have the explicit expression �^(x) �^(x) ! = nX i=1 1 (jXi � xj � h)ZiZ 0i !�1 nX i=1 1 (jXi � xj � h)Ziyi ! = nX i=1 K � H�1 (Xi � x) � ZiZ 0 i !�1 nX i=1 K � H�1 (Xi � x) � Ziyi ! where the second line is valid for any (multivariate) kernel funtion. This is a (locally) weighted regression of yi on Xi: Algebraically, this equals a WLS estimator. In contrast to the NW estimator, the LL estimator preserves linear data. That is, if the true data lie on a line yi = � +X 0i�; then for any sub-sample, a local linear regression ts exactly, so g^(x) = g(x): In fact, we will see that the distribution of the LL estimator is invariant to the rst derivative of g: It has zero bias when the true regression is linear. As h!1 (smoothing is increased), the LL estimator collapses to the OLS regression of yi on Xi: In this sense LL is a natural nonparametric generalization of least-squares regression. 31 The LL estimator also has much better properties at the boundard than the NW estimator. Intuitively, even if x is at the boundard of the regression support, as the local linear estimator ts a (weighted) least-squares line through data near the boundary, if the true relationship is linear this estimator will be unbiased. Deriving the asymptotic distribution of the LL estimator is similar to that of the NW estimator, but much more involved, so I will not present the argument here. It has the following asymptotic distribution. Let g^(x) = �^(x). Then p n jHj 0@g^(x)� g(x)� �2 qX j=1 h2j 1 2 @2 @x2j g(x) 1A d�! N �0; R(k)q�2(x) f(x) � This is quite similar to the distribution for the NW estimator, with one important di¤erence that the bias term has been simplied. The term involving f(x)�1 @ @xj g(x) @ @xj f(x) has been eliminated. The asymptotic variance is unchanged. Strictly speaking, we cannot rank the AMSE of the NW versus the LL estimator. While a bias term has been eliminated, it is possible that the two terms have opposite signs and thereby cancel somewhat. However, the standard intuition is that a simplied bias term suggests reduced bias in practice. The AMSE of the LL estimator only depends on the second derivative of g(x); while that of the NW estimator also depends on the rst derivative. We expect this to translate into reduced bias. Magically, this does not come as a cost in the asymptotic variance. These facts have led the statistics literature to focus on the LL estimator as the preferred approach. While I agree with this general view, a side not of caution is warrented. Simple simulation experiments show that the LL estimator does not always beat the NW estimator. When the regression function g(x) is quite at, the NW estimator does better. When the regression function is steeper and curvier, the LL estimator tends to do better. The explanation is that while the two have identical asymptotic variance formulae, in nite samples the NW estimator tends to have a smaller variance. This gives it an advantage in contexts where estimation bias is low (such as when the regression function is at). The reason why I mention this is that in many economic contexts, it is believed that the regression function may be quite at with respect to many regressors. In this context it may be better to use NW rather than LL. 3.7 Local Polynomial Estimation If LL improves on NW, why not local polynomial? The intuition is quite straightforward. Rather than tting a local linear equation, we can t a local quadratic, cubic, or polynomial of arbitrary order. Let p denote the order of the local polynomial. Thus p = 0 is the NW estimator, p = 1 is the LL estimator, and p = 2 is a local quadratic. Interestingly, the asymptotic behavior di¤ers depending on whether p is even or odd. 32 When p is odd (e.g. LL), then the bias is of order O(hp+1) and is proportional to g(p+1)(x) When p is even (e.g. NW or local quadratic), then the bias is of orderO(hp+2) but is proportional to g(p+2)(x) and g(p+1)(x)f (1)(x)=f(x): In either case, the variance is O � 1 n jHj � What happens is that by increasing the polynomial order from even to the next odd number, the order of the bias does change, but the bias simplies. By increasing the polynomial order from odd to the next even number, the bias order decreases. This e¤ect is analogous to the bias reduction achieved by higher-order kernels. While local linear estimation is gaining popularity in econometric practice, local polynomial methods are not typically used. I believe this is mostly because typical econometric applications have q > 1; and it is di¢ cult to apply polynomial methods in this context. 3.8 Weighted Nadaraya-Watson Estimator In the context of conditional distribution estimation, Hall et. al. (1999, JASA) and Cai (2002, ET) proposed a weighted NW estimator with the same asymptotic distribution as the LL estimator. This is discussed on pp. 187-188 of Li-Racine. The estimator takes the form g^(x) = Pn i=1 pi(x)K � H�1 (Xi � x) � yiPn i=1 pi(x)K (H �1 (Xi � x)) where pi(x) are weights. The weights satisfy pi(x) � 0 nX i=1 pi(x) = 1 nX i=1 pi(x)K � H�1 (Xi � x) � (Xi � x) = 0 The rst two requirements set up the pi(x) as weights. The third equality requires the weights to force the kernel function to satisfy local linearity. The weights are determined by empirical likelihood. Specically, for each x; you maximizePn i=1 ln pi(x) subject to the above constraints. The solutions take the form pi(x) = 1 n � 1 + �0 (Xi � x)K (H�1 (Xi � x)) � where � is a Lagrange multiplier and is found by numerical optimization. For details about empirical likelihood, see my Econometrics lecture notes. The above authors show that the estimator g^(x) has the same asymptotic distribution as LL. When the dependent variable is non-negative, yi � 0; the standard and weighted NW estimators 33 also satisfy g^(x) � 0: This is an advantage since it is obvious in this case that g(x) � 0. In contrast, the LL estimator is not necessarily non-negative. An important disadvantage of the weighted NW estimator is that it is considerably more com- putationally cumbersome than the LL estimator. The EL weights must be found separately for each x at which g^(x) is calculated. 3.9 Residual and Fit Given any nonparametric estimator g^(x) we can dene the residual e^i = yi�g^ (Xi). Numerically, this requires computing the regression estimate at each observation. For example, in the case of NW estimation, e^i = yi � Pn j=1K � H�1 (Xj �Xi) � yjPn j=1K (H �1 (Xj �Xi)) From e^i we can compute many conventional regression statistics. For example, the residual variance estimate is n�1 Pn i=1 e^ 2 i ; and R 2 has the standard formula. One cautionary remark is that since the convergence rate for g^ is slower than n�1=2; the same is true for many statistics computed from e^i: We can also compute the leave-one-out residuals e^i;i�1 = yi � g^�i (Xi) = yi � P j 6=iK � H�1 (Xj �Xi) � yjP j 6=iK (H�1 (Xj �Xi)) 3.10 Cross-Validation For NW, LL and local polynomial regression, it is critical to have a reliable data-dependent rule for bandwidth selection. One popular and practical approach is cross-validation. The motivation starts by considering the sum-of-squared errors Pn i=1 e^ 2 i : One could think about picking h to min- imize this quantity. But this is analogous to picking the number of regressors in least-squares by minimizing the sum-of-squared errors. In that context the solution is to pick all possible regressors, as the sum-of-squared errors is monotonically decreasing in the number of regressors. The same is true in nonparametric regression. As the bandwidth h decreases, the in-sample tof the model improves and Pn i=1 e^ 2 i decreases. As h shrinks to zero, g^ (Xi) collapses on yi to obtain perfect t, e^i shrinks to zero and so does Pn i=1 e^ 2 i : It is clearly a poor choice to pick h based on this criterion. Instead, we can consider the sum-of-squared leave-one-out residuals Pn i=1 e^ 2 i;i�1: This is a rea- sonable criterion. Because the quality of g^ (Xi) can be quite poor for tail values of Xi; it may be more sensible to use a trimmed verion of the sum of squared residuals, and this is called the cross-validation criterion CV (h) = 1 n nX i=1 e^2i;i�1M (Xi) 34 (We have also divided by sample size for convenience.) The funtion M(x) is a trimming function, the same as introduced in the denition of WIMSE earlier. The cross-validation bandwidth h is that which minimizes CV (h): As in the case of density estimation, this needs to be done numerically. To see that the CV criterion is sensible, let us calculate its expectation. Since yi = g (Xi) + ei; E (CV (h)) = E � (ei + g (Xi)� g^�i (Xi))2M (Xi

                    本文档为【NonParametrics2】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

NonParametrics2

你可能还喜欢