NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R
JEFFREY S. RACINE
Abstract. The R environment for statistical computing and graphics (R Development Core Team
(2008)) offers practitioners a rich set of statistical methods ranging from random number genera-
tion and optimization methods through regression, panel data, and time series methods, by way of
illustration. The standard R distribution (‘base R) comes preloaded with a rich variety of function-
ality useful for applied econometricians. This functionality is enhanced by user supplied packages
made available via R servers that are mirrored around the world. Of interest in this chapter are
methods for estimating nonparametric and semiparametric models. We summarize many of the
facilities in R and consider some tools that might be of interest to those wishing to work with
nonparametric methods who want to avoid resorting to programming in C or Fortran but need the
speed of compiled code as opposed to interpreted code such as Gauss or Matlab by way of example.
We encourage those working in the field to strongly consider implementing their methods in the
R environment thereby making their work accessible to the widest possible audience via an open
collaborative forum.
1. Introduction
Unlike their more established parametric counterparts, many nonparametric and semiparametric
methods that have received widespread theoretical treatment have not yet found their way into
mainstream commercial packages. This has hindered their adoption by applied researchers, and it
is safe to describe the availability of modern nonparametric methods as fragmented at best, which
can be frustrating for users who wish to assess whether or not such methods can add value to their
application. Thus, one frequently heard complaint about the state of nonparametric kernel methods
concerns the lack of software along with the fact that implementations in interpreted environments
such as Gauss are orders of magnitude slower than compiled implementations written in C or
Fortran. Though many researchers may code their methods, often using interpreted environments
such as Gauss, it is fair to characterize much of this code as neither designed nor suited as tools
for general purpose use as they are typically written solely to demonstrate ‘proof of concept’. Even
though many authors are more than happy to circulate such code (which is of course appreciated!),
this often imposes certain hardships on the user including 1) having to purchase a (closed and
proprietary) commercial software package and 2) having to modify the code substantially in order
to use it for their application.
The R environment for statistical computing and graphics (R Development Core Team (2008))
offers practitioners a range of tools for estimating nonparametric, semiparametric, and of course
parametric models. Unlike many commercial programs, which must first be purchased in order to
evaluate them, you can adopt R with minimal effort and with no financial outlay required. Many
Date: November 14, 2008.
1
2 JEFFREY S. RACINE
nonparametric methods are well documented, tested, and are suitable for general use via a common
interface structure (such as the ‘formula’ interface) making it easy for users familiar with R to deploy
these tools for their particular application. Furthermore, one of the strengths of R is the ability
to call compiled C or Fortran code via a common interface structure thereby delivering the speed
of complied code in a flexible easy to use environment. In addition, there exist a number of R
‘packages’ (often called ‘libraries’ or ‘modules’ in other environments) that implement a variety of
kernel methods, albeit with varying degrees of functionality (e.g., univariate versus multivariate,
the ability/inability to handle numerical and categorical data and so forth). Finally, R delivers a
rich framework for implementing and making code available to the community.
In this chapter we outline many of the functions and packages available in R that might be
of interest to practitioners, and consider some illustrative applications along with code fragments
that might be of interest. Before proceeding further, we first begin with an introduction to the R
environment itself.
2. The R Environment
What is R? Perhaps it is best to begin with the question “what is S”? S is a language and en-
vironment designed for statical computing and graphics which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies). S has grown to become the de-facto standard among
econometricians and statisticians, and there are two main implementations, the commercial imple-
mentation called ‘S-PLUS’, and the free, open-source implementation called ‘R’. R delivers a rich
array of statistical methods, and one of its strengths is the ease with which ‘packages’ can be de-
veloped and made available to users for free. R is a mature open platform that is ideally suited to
the task of making ones method available to the widest possible user base free of charge.
In this section we briefly describe a handful of resources available to those interested in using
R, introduce the user to the R environment, and introduce the user to the foreign package that
facilitates importation of data from packages such as SAS, SPSS, Stata, and Minitab, among others.
2.1. Web sites. A number of sites are devoted to helping R users, and we briefly mention a few of
them below.
http://www.R-project.org/: This is the R home page from which you can download the
program itself and many R packages. There are also manuals, other links, and facilities for
joining various R mailing lists.
http://CRAN.R-project.org/: This is the ‘Comprehensive R Archive Network,’ “a net-
work of ftp and web servers around the world that store identical, up-to-date, versions of
code and documentation for the R statistical package.” Packages are only put on CRAN
when they pass a rather stringent collection of quality assurance checks, and in particular
are guaranteed to build and run on standard platforms.
http://cran.r-project.org/web/views/Econometrics.html: This is the CRAN ‘task
view’ for computational econometrics. “Base R ships with a lot of functionality useful
NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 3
for computational econometrics, in particular in the stats package. This functionality is
complemented by many packages on CRAN, a brief overview is given below.” This provides
an excellent summary of both parametric and nonparametric packages that exist for the R
environment.
http://pj.freefaculty.org/R/Rtips.html: This site provides a large and excellent collec-
tion of R tips.
2.2. Getting started with R. A number of well written manuals exist for R and can be located
at the R web site. This section is clearly not intended to be a substitute for these resources. It
simply provides a minimal set of commands which will aid those who have never used R before.
Having installed and run R, you will find yourself at the > prompt. To quit the program, simply
type q(). To get help, you can either enter a command preceded by a question mark, as in ?help,
or type help.start() at the > prompt. The latter will spawn your web browser (it reads files
from your hard drive, so you do not have to be connected to the Internet to use this feature).
You can enter commands interactively at the R prompt, or you can create a text file con-
taining the commands and execute all commands in the file from the R prompt by typ-
ing source ("commands.R"), where commands.R is the text file containing your commands.
Many editors recognize the .R extension providing useful interface for the development of R
code. For example, GNU Emacs is a powerful editor that works well with R and also LATEX
(http://www.gnu.org/software/emacs/emacs.html).
When you quit by entering the q() command, you will be asked whether or not you wish to save
the current session. If you enter Y, then the next time you run R in the same directory it will load all
of the objects created in the previous session. If you do so, typing the command ls() will list all of
the objects. For this reason, it is wise to use different directories for different projects. To remove
objects that have been loaded, you can use the command rm(objectname) or rm(list=ls()) will
remove all objects in memory.
2.3. Importing data from other formats. The foreign package allows you to read data created
by different popular programs. To load it, simply type library(foreign) from within R. Supported
formats include
read.arff: Read Data from ARFF Files
read.dbf: Read a DBF File
read.dta: Read Stata Binary Files
read.epiinfo: Read Epi Info Data Files
read.mtp: Read a Minitab Portable Worksheet
read.octave: Read Octave Text Data Files
read.S: Read an S3 Binary or data.dump File
read.spss: Read an SPSS Data File
read.ssd: Obtain a Data Frame from a SAS Permanent Dataset, via read.xport
read.systat: Obtain a Data Frame from a Systat File
4 JEFFREY S. RACINE
read.xport: Read a SAS XPORT Format Library
The following code snippet reads the Stata file ‘wage1.dta’ (Wooldridge (2002)) and lists the
names of variables in the data frame.
R> library(foreign)
R> mydat <- read.dta(file="wage1.dta")
R> names(mydat)
[1] "wage" "educ" "exper" "tenure" "nonwhite" "female"
[7] "married" "numdep" "smsa" "northcen" "south" "west"
[13] "construc" "ndurman" "trcommpu" "trade" "services" "profserv"
[19] "profocc" "clerocc" "servocc" "lwage" "expersq" "tenursq"
Clearly R makes it simple to migrate data from one environment to another.
Having installed R and having read in data from a text file or supported format such as a
Stata binary file, you can then install packages via the install.packages() command, as in
install.packages("np") which will install the np package (Hayfield & Racine (2008)).
NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 5
3. Some Nonparametric and Semiparametric Routines Available in R
Table 1 summarizes some of the nonparametric and semiparametric routines available to users
of R. As can be seen, there appears to be a rich range of nonparametric implementations avail-
able to the practitioner. However, upon closer inspection many are limited in one way or another
in ways that might frustrate applied econometricians. For instance, some nonparametric regres-
sion methods admit only one regressor, while others admit only numerical data types and cannot
admit categorical data that is often found in applied settings. Table 1 is not intended to be
exhaustive, rather, it ought to serve to orient the reader to a subset of the rich array of nonpara-
metric methods that currently exist in the R environment. To see a routine in action, you can
type example("funcname",package="pkgname") where funcname is the name of a routine and
pkgname is the associated package and this will run an example contained in the help file for that
function. For instance, example("npreg",package="np") will run a kernel regression example
from the package np.
6 JEFFREY S. RACINE
Table 1. An illustrative summary of R packages that implement nonparametric methods.
Package Function Description
ash ash1 Computes univariate averaged shifted histograms
ash2 Computes bivariate averaged shifted histograms
car n.bins Computes number of bins for histograms with different rules
gam gam Computes generalized additive models using the method described in Hastie & Tibshirani
(1990)
GenKern KernSec Computes univariate kernel density estimates
KernSur Computes bivariate kernel density estimates
Graphics boxplot Produces box-and-whisker plot(s)
(base) nclass.Sturges Computes the number of classes for a histogram
nclass.scott Computes the number of classes for a histogram
nclass.FD Computes the number of classes for a histogram
KernSmooth bkde Computes a univariate binned kernel density estimate using the fast Fourier transform as
described in Silverman (1982)
bkde2D Compute a bivariate binned kernel density estimate as described in Wand (1994)
dpik Computes a bandwidth for a univariate kernel density estimate using the method described
in Sheather & Jones (1991)
dpill Computes a bandwidth for univariate local linear regression using the method described in
Ruppert, Sheather & Wand (1995)
locpoly Computes a univariate probability density function, bivariate regression function or their
derivatives using local polynomials
ks kde Computes a multivariate kernel density estimate for 1- to 6-dimensional numerical data
locfit locfit Computes univariate local regression and likelihood models
sjpi Computes a bandwidth via the plug-in Sheather & Jones (1991) method
kdeb Computes univariate kernel density estimate bandwidths
MASS bandwidth.nrd Computes Silverman’s rule-of-thumb for choosing the bandwidth of a univariate Gaussian
kernel density estimator
hist.scott Plot a histogram with automatic bin width selection (Scott)
hist.FD Plot a histogram with automatic bin width selection (Freedman-Diaconis)
kde2d Computes a bivariate kernel density estimate
width.SJ Computes the Sheather & Jones (1991) bandwidth for a univariate Gaussian kernel density
estimator
bcv Computes biased cross-validation bandwidth selection for a univariate Gaussian kernel density
estimator
ucv Computes unbiased cross-validation bandwidth selection for of a univariate Gaussian kernel
density estimator
np npcdens Computes a multivariate conditional density as described in Hall, Racine & Li (2004)
npcdist Computes a multivariate conditional distribution as described in Li & Racine (forthcoming)
npcmstest Conducts a parametric model specification test as described in Hsiao, Li & Racine (2007)
npconmode Conducts multivariate modal regression
npindex computes a multivariate single index model as described in Ichimura (1993), Klein & Spady
(1993)
npksum Computes multivariate kernel sums with numeric and categorical data types
npplot Conducts general purpose plotting of nonparametric objects
npplreg computes a multivariate partially linear model as described in Robinson (1988), Racine & Liu
(2007)
npqcmstest Conducts a parametric quantile regression model specification test as described in Zheng
(1998), Racine (2006)
npqreg Computes multivariate quantile regression as described in Li & Racine (forthcoming)
npreg Computes multivariate regression as described in Racine & Li (2004), Li & Racine (2004)
npscoef Computes multivariate smooth coefficient models as described in Li & Racine (2007b)
npsigtest Computes the significance test as described in Racine (1997), Racine, Hart & Li (2006)
npudens Computes multivariate density estimation as described in Parzen (1962), Rosenblatt (1956),
Li & Racine (2003)
npudist Computes multivariate distribution functions as described in Parzen (1962), Rosenblatt (1956),
Li & Racine (2003)
stats bw.nrd Univariate bandwidth selectors for gaussian windows in density
density Computes a univariate kernel density estimate
(base) hist Computes a univariate histogram
smooth.spline Computes a univariate cubic smoothing spline as described in Chambers & Hastie (1991)
ksmooth Computes a univariate Nadaraya-Watson kernel regression estimate described in Wand &
Jones (1995)
loess Computes a smooth curve fitted by the loess method described in Cleveland, Grosse & Shyu
(1992) (1-4 numeric predictors)
NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 7
3.1. Nonparametric Density Estimation in R. Univariate density estimation is one of the most
popular exploratory nonparametric methods in use today. Readers will no doubt be intimately
familiar with two popular nonparametric estimators, namely the univariate histogram and kernel
estimators. For an in-depth treatment of kernel density estimation we direct the interested reader
to the wonderful monographs by Silverman (1986) and Scott (1992), while for mixed data density
estimation we direct the reader to Li & Racine (2003) and the references therein. We shall begin
with an illustrative parametric example.
Consider any random variable X having probability density function f(x), and let f(·) be the
object of interest. Suppose one is presented with a series of independent and identically distributed
draws from the unknown distribution and asked to model the density of the data, f(x).
For this example we shall simulate n = 500 draws but immediately discard knowledge of the
true data generating process (DGP) pretending that we are unaware that the data is drawn from a
mixture of normals (N(−2, 0.25) and N(3, 2.25) with equal probability). The following code snippet
demonstrates one way to draw random samples from a mixture of normals.
R> library(np)
Nonparametric Kernel Methods for Mixed Datatypes (version 0.20-3)
R> set.seed(123)
R> n <- 250
R> x <- sort(c(rnorm(n,mean=-2,sd=0.5),rnorm(n,mean=3,sd=1.5)))
The following figure plots the true DGP evaluated on an equally spaced grid of 1, 000 points.
R> x.seq <- seq(-5,9,length=1000)
R> plot(x.seq,0.5*dnorm(x.seq,mean=-2,sd=0.5)+0.5*dnorm(x.seq,mean=3,sd=1.5),
+ xlab="X",
+ ylab="Mixture of Normal Densities",
+ type="l",
+ main="",
+ col="blue",
+ lty=1)
8 JEFFREY S. RACINE
−4 −2 0 2 4 6 8
0.
0
0.
1
0.
2
0.
3
0.
4
X
M
ix
tu
re
o
f N
or
m
al
D
en
sit
ie
s
Suppose one na¨ıvely presumed that the data is drawn from, say, the normal parametric family
(not a mixture thereof), then tested this assumption using the Shapiro-Wilks test. The following
code snipped demonstrates how this is done in R.
R> shapiro.test(x)
Shapiro-Wilk normality test
data: x
W = 0.87, p-value < 2.2e-16
Given that this popular parametric model is flatly rejected by this dataset, we have two choices,
namely 1) search for a more appropriate parametric model or 2) use more flexible estimators.
For what follows, we shall presume that the reader has found themselves in just such a situation.
That is, they have faithfully applied a parametric method and conducted a series of tests of model
adequacy that indicate that the parametric model is not consistent with the underlying DGP. They
then turn to more flexible methods of density estimation. Note that though we are considering
NONPARAMETRIC AND SEMIPARAMETRIC METHODS IN R 9
density estimation at the moment, it could be virtually any parametric approach that we have been
discussing, for instance, regression analysis and so forth.
If one wished to examine a histogram one could use the following code snippet,
R> hist(x,prob=TRUE,main="")
x
D
en
si
ty
−4 −2 0 2 4 6
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Of course, though consistent, the histogram suffers from a number of drawbacks hence one
might instead consider a smooth nonparametric density estimator such as the univariate Parzen
kernel estimator (Parzen (1962)). A univariate kernel estimator can be obtained using the density
command that is part of R base. This function supports a range of bandwidth methods (see ?bw.nrd
for details) and kernels (see ?density for details). The default bandwidth method is Silverman’s
’rule of thumb’ (Silverman (1986, page 48, eqn (3.31))), and for this data we obtain the following:
R> plot(density(x),main="")
10 JEFFREY S. RACINE
−4 −2 0 2 4 6 8
0.
00
0.
05
0.
10
0.
15
0.
20
N = 500 Bandwidth = 0.7256
D
en
si
ty
The density function in R has a number of virtues. It is extremely fast computationally speaking
as the algorithm disperses the mass of the empirical distribution function over a regular grid and
then uses the fast Fourier transform to convolve this approximation with a discretized version of
the kernel and then uses a linear approximation to evaluate the density at the specified points. If
one wishes to obtain a univariate kernel estimate for a large sample of data then this is definitely
the function of choice. However, for a bivariate (or higher dimensional) density estimate one would
requ
本文档为【Nonparametric and semiparametric methods in R】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。