Nonparametric and semiparametric methods in R

Nonparametric and semiparametric methods in R NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R JEFFREY S. RACINE Abstract. The R environment for statistical computing and graphics (R Development Core Team (2008)) offers practitioners a rich set of statistical methods ranging from random number genera- tio...

NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R JEFFREY S. RACINE Abstract. The R environment for statistical computing and graphics (R Development Core Team (2008)) offers practitioners a rich set of statistical methods ranging from random number genera- tion and optimization methods through regression, panel data, and time series methods, by way of illustration. The standard R distribution (‘base R) comes preloaded with a rich variety of function- ality useful for applied econometricians. This functionality is enhanced by user supplied packages made available via R servers that are mirrored around the world. Of interest in this chapter are methods for estimating nonparametric and semiparametric models. We summarize many of the facilities in R and consider some tools that might be of interest to those wishing to work with nonparametric methods who want to avoid resorting to programming in C or Fortran but need the speed of compiled code as opposed to interpreted code such as Gauss or Matlab by way of example. We encourage those working in the field to strongly consider implementing their methods in the R environment thereby making their work accessible to the widest possible audience via an open collaborative forum. 1. Introduction Unlike their more established parametric counterparts, many nonparametric and semiparametric methods that have received widespread theoretical treatment have not yet found their way into mainstream commercial packages. This has hindered their adoption by applied researchers, and it is safe to describe the availability of modern nonparametric methods as fragmented at best, which can be frustrating for users who wish to assess whether or not such methods can add value to their application. Thus, one frequently heard complaint about the state of nonparametric kernel methods concerns the lack of software along with the fact that implementations in interpreted environments such as Gauss are orders of magnitude slower than compiled implementations written in C or Fortran. Though many researchers may code their methods, often using interpreted environments such as Gauss, it is fair to characterize much of this code as neither designed nor suited as tools for general purpose use as they are typically written solely to demonstrate ‘proof of concept’. Even though many authors are more than happy to circulate such code (which is of course appreciated!), this often imposes certain hardships on the user including 1) having to purchase a (closed and proprietary) commercial software package and 2) having to modify the code substantially in order to use it for their application. The R environment for statistical computing and graphics (R Development Core Team (2008)) offers practitioners a range of tools for estimating nonparametric, semiparametric, and of course parametric models. Unlike many commercial programs, which must first be purchased in order to evaluate them, you can adopt R with minimal effort and with no financial outlay required. Many Date: November 14, 2008. 1 2 JEFFREY S. RACINE nonparametric methods are well documented, tested, and are suitable for general use via a common interface structure (such as the ‘formula’ interface) making it easy for users familiar with R to deploy these tools for their particular application. Furthermore, one of the strengths of R is the ability to call compiled C or Fortran code via a common interface structure thereby delivering the speed of complied code in a flexible easy to use environment. In addition, there exist a number of R ‘packages’ (often called ‘libraries’ or ‘modules’ in other environments) that implement a variety of kernel methods, albeit with varying degrees of functionality (e.g., univariate versus multivariate, the ability/inability to handle numerical and categorical data and so forth). Finally, R delivers a rich framework for implementing and making code available to the community. In this chapter we outline many of the functions and packages available in R that might be of interest to practitioners, and consider some illustrative applications along with code fragments that might be of interest. Before proceeding further, we first begin with an introduction to the R environment itself. 2. The R Environment What is R? Perhaps it is best to begin with the question “what is S”? S is a language and en- vironment designed for statical computing and graphics which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies). S has grown to become the de-facto standard among econometricians and statisticians, and there are two main implementations, the commercial imple- mentation called ‘S-PLUS’, and the free, open-source implementation called ‘R’. R delivers a rich array of statistical methods, and one of its strengths is the ease with which ‘packages’ can be de- veloped and made available to users for free. R is a mature open platform that is ideally suited to the task of making ones method available to the widest possible user base free of charge. In this section we briefly describe a handful of resources available to those interested in using R, introduce the user to the R environment, and introduce the user to the foreign package that facilitates importation of data from packages such as SAS, SPSS, Stata, and Minitab, among others. 2.1. Web sites. A number of sites are devoted to helping R users, and we briefly mention a few of them below. http://www.R-project.org/: This is the R home page from which you can download the program itself and many R packages. There are also manuals, other links, and facilities for joining various R mailing lists. http://CRAN.R-project.org/: This is the ‘Comprehensive R Archive Network,’ “a net- work of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for the R statistical package.” Packages are only put on CRAN when they pass a rather stringent collection of quality assurance checks, and in particular are guaranteed to build and run on standard platforms. http://cran.r-project.org/web/views/Econometrics.html: This is the CRAN ‘task view’ for computational econometrics. “Base R ships with a lot of functionality useful NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 3 for computational econometrics, in particular in the stats package. This functionality is complemented by many packages on CRAN, a brief overview is given below.” This provides an excellent summary of both parametric and nonparametric packages that exist for the R environment. http://pj.freefaculty.org/R/Rtips.html: This site provides a large and excellent collec- tion of R tips. 2.2. Getting started with R. A number of well written manuals exist for R and can be located at the R web site. This section is clearly not intended to be a substitute for these resources. It simply provides a minimal set of commands which will aid those who have never used R before. Having installed and run R, you will find yourself at the > prompt. To quit the program, simply type q(). To get help, you can either enter a command preceded by a question mark, as in ?help, or type help.start() at the > prompt. The latter will spawn your web browser (it reads files from your hard drive, so you do not have to be connected to the Internet to use this feature). You can enter commands interactively at the R prompt, or you can create a text file con- taining the commands and execute all commands in the file from the R prompt by typ- ing source ("commands.R"), where commands.R is the text file containing your commands. Many editors recognize the .R extension providing useful interface for the development of R code. For example, GNU Emacs is a powerful editor that works well with R and also LATEX (http://www.gnu.org/software/emacs/emacs.html). When you quit by entering the q() command, you will be asked whether or not you wish to save the current session. If you enter Y, then the next time you run R in the same directory it will load all of the objects created in the previous session. If you do so, typing the command ls() will list all of the objects. For this reason, it is wise to use different directories for different projects. To remove objects that have been loaded, you can use the command rm(objectname) or rm(list=ls()) will remove all objects in memory. 2.3. Importing data from other formats. The foreign package allows you to read data created by different popular programs. To load it, simply type library(foreign) from within R. Supported formats include read.arff: Read Data from ARFF Files read.dbf: Read a DBF File read.dta: Read Stata Binary Files read.epiinfo: Read Epi Info Data Files read.mtp: Read a Minitab Portable Worksheet read.octave: Read Octave Text Data Files read.S: Read an S3 Binary or data.dump File read.spss: Read an SPSS Data File read.ssd: Obtain a Data Frame from a SAS Permanent Dataset, via read.xport read.systat: Obtain a Data Frame from a Systat File 4 JEFFREY S. RACINE read.xport: Read a SAS XPORT Format Library The following code snippet reads the Stata file ‘wage1.dta’ (Wooldridge (2002)) and lists the names of variables in the data frame. R> library(foreign) R> mydat <- read.dta(file="wage1.dta") R> names(mydat) [1] "wage" "educ" "exper" "tenure" "nonwhite" "female" [7] "married" "numdep" "smsa" "northcen" "south" "west" [13] "construc" "ndurman" "trcommpu" "trade" "services" "profserv" [19] "profocc" "clerocc" "servocc" "lwage" "expersq" "tenursq" Clearly R makes it simple to migrate data from one environment to another. Having installed R and having read in data from a text file or supported format such as a Stata binary file, you can then install packages via the install.packages() command, as in install.packages("np") which will install the np package (Hayfield & Racine (2008)). NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 5 3. Some Nonparametric and Semiparametric Routines Available in R Table 1 summarizes some of the nonparametric and semiparametric routines available to users of R. As can be seen, there appears to be a rich range of nonparametric implementations avail- able to the practitioner. However, upon closer inspection many are limited in one way or another in ways that might frustrate applied econometricians. For instance, some nonparametric regres- sion methods admit only one regressor, while others admit only numerical data types and cannot admit categorical data that is often found in applied settings. Table 1 is not intended to be exhaustive, rather, it ought to serve to orient the reader to a subset of the rich array of nonpara- metric methods that currently exist in the R environment. To see a routine in action, you can type example("funcname",package="pkgname") where funcname is the name of a routine and pkgname is the associated package and this will run an example contained in the help file for that function. For instance, example("npreg",package="np") will run a kernel regression example from the package np. 6 JEFFREY S. RACINE Table 1. An illustrative summary of R packages that implement nonparametric methods. Package Function Description ash ash1 Computes univariate averaged shifted histograms ash2 Computes bivariate averaged shifted histograms car n.bins Computes number of bins for histograms with different rules gam gam Computes generalized additive models using the method described in Hastie & Tibshirani (1990) GenKern KernSec Computes univariate kernel density estimates KernSur Computes bivariate kernel density estimates Graphics boxplot Produces box-and-whisker plot(s) (base) nclass.Sturges Computes the number of classes for a histogram nclass.scott Computes the number of classes for a histogram nclass.FD Computes the number of classes for a histogram KernSmooth bkde Computes a univariate binned kernel density estimate using the fast Fourier transform as described in Silverman (1982) bkde2D Compute a bivariate binned kernel density estimate as described in Wand (1994) dpik Computes a bandwidth for a univariate kernel density estimate using the method described in Sheather & Jones (1991) dpill Computes a bandwidth for univariate local linear regression using the method described in Ruppert, Sheather & Wand (1995) locpoly Computes a univariate probability density function, bivariate regression function or their derivatives using local polynomials ks kde Computes a multivariate kernel density estimate for 1- to 6-dimensional numerical data locfit locfit Computes univariate local regression and likelihood models sjpi Computes a bandwidth via the plug-in Sheather & Jones (1991) method kdeb Computes univariate kernel density estimate bandwidths MASS bandwidth.nrd Computes Silverman’s rule-of-thumb for choosing the bandwidth of a univariate Gaussian kernel density estimator hist.scott Plot a histogram with automatic bin width selection (Scott) hist.FD Plot a histogram with automatic bin width selection (Freedman-Diaconis) kde2d Computes a bivariate kernel density estimate width.SJ Computes the Sheather & Jones (1991) bandwidth for a univariate Gaussian kernel density estimator bcv Computes biased cross-validation bandwidth selection for a univariate Gaussian kernel density estimator ucv Computes unbiased cross-validation bandwidth selection for of a univariate Gaussian kernel density estimator np npcdens Computes a multivariate conditional density as described in Hall, Racine & Li (2004) npcdist Computes a multivariate conditional distribution as described in Li & Racine (forthcoming) npcmstest Conducts a parametric model specification test as described in Hsiao, Li & Racine (2007) npconmode Conducts multivariate modal regression npindex computes a multivariate single index model as described in Ichimura (1993), Klein & Spady (1993) npksum Computes multivariate kernel sums with numeric and categorical data types npplot Conducts general purpose plotting of nonparametric objects npplreg computes a multivariate partially linear model as described in Robinson (1988), Racine & Liu (2007) npqcmstest Conducts a parametric quantile regression model specification test as described in Zheng (1998), Racine (2006) npqreg Computes multivariate quantile regression as described in Li & Racine (forthcoming) npreg Computes multivariate regression as described in Racine & Li (2004), Li & Racine (2004) npscoef Computes multivariate smooth coefficient models as described in Li & Racine (2007b) npsigtest Computes the significance test as described in Racine (1997), Racine, Hart & Li (2006) npudens Computes multivariate density estimation as described in Parzen (1962), Rosenblatt (1956), Li & Racine (2003) npudist Computes multivariate distribution functions as described in Parzen (1962), Rosenblatt (1956), Li & Racine (2003) stats bw.nrd Univariate bandwidth selectors for gaussian windows in density density Computes a univariate kernel density estimate (base) hist Computes a univariate histogram smooth.spline Computes a univariate cubic smoothing spline as described in Chambers & Hastie (1991) ksmooth Computes a univariate Nadaraya-Watson kernel regression estimate described in Wand & Jones (1995) loess Computes a smooth curve fitted by the loess method described in Cleveland, Grosse & Shyu (1992) (1-4 numeric predictors) NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 7 3.1. Nonparametric Density Estimation in R. Univariate density estimation is one of the most popular exploratory nonparametric methods in use today. Readers will no doubt be intimately familiar with two popular nonparametric estimators, namely the univariate histogram and kernel estimators. For an in-depth treatment of kernel density estimation we direct the interested reader to the wonderful monographs by Silverman (1986) and Scott (1992), while for mixed data density estimation we direct the reader to Li & Racine (2003) and the references therein. We shall begin with an illustrative parametric example. Consider any random variable X having probability density function f(x), and let f(·) be the object of interest. Suppose one is presented with a series of independent and identically distributed draws from the unknown distribution and asked to model the density of the data, f(x). For this example we shall simulate n = 500 draws but immediately discard knowledge of the true data generating process (DGP) pretending that we are unaware that the data is drawn from a mixture of normals (N(−2, 0.25) and N(3, 2.25) with equal probability). The following code snippet demonstrates one way to draw random samples from a mixture of normals. R> library(np) Nonparametric Kernel Methods for Mixed Datatypes (version 0.20-3) R> set.seed(123) R> n <- 250 R> x <- sort(c(rnorm(n,mean=-2,sd=0.5),rnorm(n,mean=3,sd=1.5))) The following figure plots the true DGP evaluated on an equally spaced grid of 1, 000 points. R> x.seq <- seq(-5,9,length=1000) R> plot(x.seq,0.5*dnorm(x.seq,mean=-2,sd=0.5)+0.5*dnorm(x.seq,mean=3,sd=1.5), + xlab="X", + ylab="Mixture of Normal Densities", + type="l", + main="", + col="blue", + lty=1) 8 JEFFREY S. RACINE −4 −2 0 2 4 6 8 0. 0 0. 1 0. 2 0. 3 0. 4 X M ix tu re o f N or m al D en sit ie s Suppose one na¨ıvely presumed that the data is drawn from, say, the normal parametric family (not a mixture thereof), then tested this assumption using the Shapiro-Wilks test. The following code snipped demonstrates how this is done in R. R> shapiro.test(x) Shapiro-Wilk normality test data: x W = 0.87, p-value < 2.2e-16 Given that this popular parametric model is flatly rejected by this dataset, we have two choices, namely 1) search for a more appropriate parametric model or 2) use more flexible estimators. For what follows, we shall presume that the reader has found themselves in just such a situation. That is, they have faithfully applied a parametric method and conducted a series of tests of model adequacy that indicate that the parametric model is not consistent with the underlying DGP. They then turn to more flexible methods of density estimation. Note that though we are considering NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 9 density estimation at the moment, it could be virtually any parametric approach that we have been discussing, for instance, regression analysis and so forth. If one wished to examine a histogram one could use the following code snippet, R> hist(x,prob=TRUE,main="") x D en si ty −4 −2 0 2 4 6 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Of course, though consistent, the histogram suffers from a number of drawbacks hence one might instead consider a smooth nonparametric density estimator such as the univariate Parzen kernel estimator (Parzen (1962)). A univariate kernel estimator can be obtained using the density command that is part of R base. This function supports a range of bandwidth methods (see ?bw.nrd for details) and kernels (see ?density for details). The default bandwidth method is Silverman’s ’rule of thumb’ (Silverman (1986, page 48, eqn (3.31))), and for this data we obtain the following: R> plot(density(x),main="") 10 JEFFREY S. RACINE −4 −2 0 2 4 6 8 0. 00 0. 05 0. 10 0. 15 0. 20 N = 500 Bandwidth = 0.7256 D en si ty The density function in R has a number of virtues. It is extremely fast computationally speaking as the algorithm disperses the mass of the empirical distribution function over a regular grid and then uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel and then uses a linear approximation to evaluate the density at the specified points. If one wishes to obtain a univariate kernel estimate for a large sample of data then this is definitely the function of choice. However, for a bivariate (or higher dimensional) density estimate one would requ

                    本文档为【Nonparametric and semiparametric methods in R】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Nonparametric and semiparametric methods in R

你可能还喜欢