Crossvalidation using replicate weights
withCrossval.RdIn each set of replicate weights there will be some clusters that have essentially zero weight. These are used as the test set, with the other clusters used as the training set. Jackknife weights ("JK1","JKn") are very similar to cross-validation at the cluster level; bootstrap weights are similar to bootstrapping for cross-validation.
Usage
withCrossval(design, formula, trainfun, testfun, loss = c("MSE",
"entropy", "AbsError"), intercept, tuning, nearly_zero=1e-4,...)Arguments
- design
A survey design object (currently only
svyrep.design)- formula
Model formula where the left-hand side specifies the outcome vairable and the right-hand side specifies the variables that will be used for prediction
- trainfun
Function taking a predictor matrix
X, an outcome vectory, a weights vectorw, and an element oftuning, and training a model that is returned as some R object.- testfun
Function taking a predictor matrix
Xand the output fromtrainfunand returning fitted values for the outcome variable.- loss
Loss function for assessing prediction
- intercept
Should the predictor matrix have an intercept added?
- tuning
vector of tuning parameters, such as the regularisation parameter in information criteria or the number of predictors.
trainfunandtestfunwill be called with each element of this vector in turn. Use any single-element vector if no tuning parameter is needed- nearly_zero
test-set threshold on the scale of replicate weight divided by sampling weight.
- ...
future expansion
References
Iparragirre, A., Lumley, T., Barrio, I., & Arostegui, I. (2023). Variable selection with LASSO regression for complex survey data. Stat, 12(1), e578.
Examples
data(api)
rclus1<-as.svrepdesign(svydesign(id=~dnum, weights=~pw, data=apiclus1,
fpc=~fpc))
withCrossval(rclus1, api00~api99+ell+stype,
trainfun=function(X,y,w,tuning) lm.wfit(X,y,w),
testfun=function(X, trainfit,tuning) X%*%coef(trainfit),
intercept=TRUE,loss="MSE",tuning=1)
#> [1] 723.04
## More realistic example using lasso
## tuning parameter is number of variables in model
##
## library(glmnet)
## ftrain=function(X,y,w,tuning) {
## m<-glmnet(X,y,weights=w)
## lambda<-m$lambda[min(which(m$df>=tuning))]
## list(m,lambda)
## }
## ftest=function(X, trainfit, tuning){
## predict(trainfit[[1]], newx=X, s=trainfit[[2]])
## }
##
## withCrossval(rclus1, api00~api99+ell+stype+mobility+enroll,
## trainfun=ftrain,
## testfun=ftest,
## intercept=FALSE,loss="MSE",
## tuning=0:3)
##
## [1] 11445.2379 9649.1150 800.0742 787.4171
##
## Models with two or three predictors are about equally good