np.sigtest.Rdnpsigtest implements a consistent test of significance of an
explanatory variable(s) in a nonparametric regression setting that is
analogous to a simple \(t\)-test (\(F\)-test) in a parametric
regression setting. The test is based on Racine, Hart, and Li (2006)
and Racine (1997).
npsigtest(bws, ...)
# S3 method for class 'formula'
npsigtest(bws, data = NULL, ...)
# S3 method for class 'call'
npsigtest(bws, ...)
# S3 method for class 'npregression'
npsigtest(bws, ...)
# Default S3 method
npsigtest(bws, xdat, ydat, ...)
# S3 method for class 'rbandwidth'
npsigtest(bws,
xdat = stop("data xdat missing"),
ydat = stop("data ydat missing"),
boot.num = 399,
boot.method = c("iid","wild","wild-rademacher","pairwise"),
boot.type = c("I","II"),
pivot=TRUE,
joint=FALSE,
index = seq(1,ncol(xdat)),
random.seed = 42,
...)a bandwidth specification. This can be set as a rbandwidth
object returned from a previous invocation, or as a vector of
bandwidths, with each element \(i\) corresponding to the bandwidth
for column \(i\) in xdat. In either case, the bandwidth
supplied will serve as a starting point in the numerical search for
optimal bandwidths when using boot.type="II". If specified
as a vector, then additional arguments will need to be supplied as
necessary to specify the bandwidth type, kernel types, selection
methods, and so on.
an optional data frame, list or environment (or object coercible to
a data frame by as.data.frame) containing the
variables in the model. If not found in data, the variables are
taken from environment(bws), typically the environment from
which npregbw was called.
a \(p\)-variate data frame of explanatory data (training data) used to calculate the regression estimators.
a one (1) dimensional numeric or integer vector of dependent data,
each element \(i\) corresponding to each observation (row) \(i\)
of xdat.
a character string used to specify the bootstrap method for
determining the null distribution. pairwise resamples
pairwise, while the remaining methods use a residual bootstrap
procedure. iid will generate independent identically
distributed draws. wild will use a wild
bootstrap. wild-rademacher will use a wild bootstrap with
Rademacher variables. Defaults to iid.
an integer value specifying the number of bootstrap replications to
use. Defaults to 399.
a character string specifying whether to use a ‘Bootstrap I’ or
‘Bootstrap II’ method (see Racine, Hart, and Li (2006) for
details). The ‘Bootstrap II’ method re-runs cross-validation for
each bootstrap replication and uses the new cross-validated
bandwidth for variable \(i\) and the original ones for the
remaining variables. Defaults to boot.type="I".
a logical value which specifies whether to bootstrap a pivotal
statistic or not (pivoting is achieved by dividing gradient
estimates by their asymptotic standard errors). Defaults to
TRUE.
a logical value which specifies whether to conduct a joint test or
individual test. This is to be used in conjunction with index
where index contains two or more integers corresponding to
the variables being tested, where the integers correspond to the
variables in the order in which they appear among the set of
explanatory variables in the function call to
npreg/npregbw. Defaults to FALSE.
a vector of indices for the columns of xdat for which the
test of significance is to be conducted. Defaults to
(1,2,...,\(p\)) where \(p\) is the number of columns in
xdat.
an integer used to seed R's random number generator. This is to ensure replicability. Defaults to 42.
additional arguments supplied to specify the bandwidth type, kernel types, selection methods, and so on, detailed below.
npsigtest implements a variety of methods for computing the
null distribution of the test statistic and allows the user to
investigate the impact of a variety of default settings including
whether or not to pivot the statistic (pivot), whether pairwise
or residual resampling is to be used (boot.method), and whether
or not to recompute the bandwidths for the variables being tested
(boot.type), among others.
Defaults are chosen so as to provide reasonable behaviour in a broad
range of settings and this involves a trade-off between computational
expense and finite-sample performance. However, the default
boot.type="I", though computationally expedient, can deliver a
test that can be slightly over-sized in small sample settings (e.g.
at the 5% level the test might reject 8% of the time for samples of
size \(n=100\) for some data generating processes). If the default
setting (boot.type="I") delivers a P-value that is in the
neighborhood (i.e. slightly smaller) of any classical level
(e.g. 0.05) and you only have a modest amount of data, it might be
prudent to re-run the test using the more computationally intensive
boot.type="II" setting to confirm the original result. Note
also that boot.method="pairwise" is not recommended for the
multivariate local linear estimator due to substantial size
distortions that may arise in certain cases.
npsigtest returns an object of type
sigtest. summary supports sigtest objects. It
has the
following components:
the vector of statistics In
the vector of P-values for each statistic in In
contains a matrix of the bootstrap
replications of the vector In, each column corresponding to
replications associated with explanatory variables in xdat
indexed by index (e.g., if you selected index = c(1,4)
then In.bootstrap will have two columns, the first being the
bootstrap replications of In associated with variable
1, the second with variable 4).
Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method,” Biometrika, 63, 413-420.
Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.
Racine, J.S., J. Hart, and Q. Li (2006), “Testing the significance of categorical predictor variables in nonparametric regression models,” Econometric Reviews, 25, 523-544.
Racine, J.S. (1997), “Consistent significance testing for nonparametric regression,” Journal of Business and Economic Statistics 15, 369-379.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions,” Biometrika, 68, 301-309.
If you are using data of mixed types, then it is advisable to use the
data.frame function to construct your input data and not
cbind, since cbind will typically not work
as intended on mixed data types and will coerce the data to the same
type.
Caution: bootstrap methods are, by their nature, computationally
intensive. This can be frustrating for users possessing large
datasets. For exploratory purposes, you may wish to override the
default number of bootstrap replications, say, setting them to
boot.num=99. A version of this package using the Rmpi
wrapper is under development that allows one to deploy this software
in a clustered computing environment to facilitate computation
involving large datasets.
if (FALSE) { # \dontrun{
# EXAMPLE 1 (INTERFACE=FORMULA): For this example, we simulate 100 draws
# from a DGP in which z, the first column of X, is an irrelevant
# discrete variable
set.seed(12345)
n <- 100
z <- rbinom(n,1,.5)
x1 <- rnorm(n)
x2 <- runif(n,-2,2)
y <- x1 + x2 + rnorm(n)
# Next, we must compute bandwidths for our regression model. In this
# case we conduct local linear regression. Note - this may take a few
# minutes depending on the speed of your computer...
bw <- npregbw(formula=y~factor(z)+x1+x2,regtype="ll",bwmethod="cv.aic")
# We then compute a vector of tests corresponding to the columns of
# X. Note - this may take a few minutes depending on the speed of your
# computer... we have to generate the null distribution of the statistic
# for each variable whose significance is being tested using 399
# bootstrap replications for each...
npsigtest(bws=bw)
# If you wished, you could conduct the test for, say, variables 1 and 3
# only, as in
npsigtest(bws=bw,index=c(1,3))
# EXAMPLE 1 (INTERFACE=DATA FRAME): For this example, we simulate 100
# draws from a DGP in which z, the first column of X, is an irrelevant
# discrete variable
set.seed(12345)
n <- 100
z <- rbinom(n,1,.5)
x1 <- rnorm(n)
x2 <- runif(n,-2,2)
X <- data.frame(factor(z),x1,x2)
y <- x1 + x2 + rnorm(n)
# Next, we must compute bandwidths for our regression model. In this
# case we conduct local linear regression. Note - this may take a few
# minutes depending on the speed of your computer...
bw <- npregbw(xdat=X,ydat=y,regtype="ll",bwmethod="cv.aic")
# We then compute a vector of tests corresponding to the columns of
# X. Note - this may take a few minutes depending on the speed of your
# computer... we have to generate the null distribution of the statistic
# for each variable whose significance is being tested using 399
# bootstrap replications for each...
npsigtest(bws=bw)
# If you wished, you could conduct the test for, say, variables 1 and 3
# only, as in
npsigtest(bws=bw,index=c(1,3))
} # }