Survey-weighted generalised linear models.
svyglm.RdFit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors.
Usage
# S3 method for class 'survey.design'
svyglm(formula, design, subset=NULL,
family=stats::gaussian(),start=NULL, rescale=TRUE, ..., deff=FALSE, influence=FALSE,
std.errors=c("linearized","Bell-McCaffrey","Bell-McCaffrey-2"),degf=FALSE)
# S3 method for class 'svyrep.design'
svyglm(formula, design, subset=NULL,
family=stats::gaussian(),start=NULL, rescale=NULL, ..., rho=NULL,
return.replicates=FALSE, na.action,multicore=getOption("survey.multicore"))
# S3 method for class 'svyglm'
summary(object, correlation = FALSE, df.resid=NULL, ...)
# S3 method for class 'svyglm'
predict(object,newdata=NULL,total=NULL,
type=c("link","response","terms"),
se.fit=(type != "terms"),vcov=FALSE,...)
# S3 method for class 'svrepglm'
predict(object,newdata=NULL,total=NULL,
type=c("link","response","terms"),
se.fit=(type != "terms"),vcov=FALSE,
return.replicates=!is.null(object$replicates),...)Arguments
- formula
Model formula
- design
Survey design from
svydesignorsvrepdesign. Must contain all variables in the formula- subset
Expression to select a subpopulation
- family
familyobject forglm- start
Starting values for the coefficients (needed for some uncommon link/family combinations)
- rescale
Rescaling of weights, to improve numerical stability. The default rescales weights to sum to the sample size. Use
FALSEto not rescale weights. For replicate-weight designs, useTRUEto rescale weights to sum to 1, as was the case before version 3.34.- ...
Other arguments passed to
glmorsummary.glm- rho
For replicate BRR designs, to specify the parameter for Fay's variance method, giving weights of
rhoand2-rho- return.replicates
Return the replicates as the
replicatescomponent of the result? (forpredict, only possible if they were computed in thesvyglmfit)- deff
Estimate the design effects
- influence
Return influence functions
- std.errors
The kind of standard errors to compute
- degf
Whether to compute the adjusted degrees of freedom along with Bell-McCaffrey standard errors
- object
A
svyglmobject- correlation
Include the correlation matrix of parameters?
- na.action
Handling of NAs
- multicore
Use the
multicorepackage to distribute replicates across processors?- df.resid
Optional denominator degrees of freedom for Wald tests
- newdata
new data frame for prediction
- total
population size when predicting population total
- type
linear predictor (
link) or response- se.fit
if
TRUE, return variances of predictions- vcov
if
TRUEandse=TRUEreturn full variance-covariance matrix of predictions
Details
For binomial and Poisson families use family=quasibinomial()
and family=quasipoisson() to avoid a warning about non-integer
numbers of successes. The `quasi' versions of the family objects give
the same point estimates and standard errors and do not give the
warning.
If df.resid is not specified the df for the null model is
computed by degf and the residual df computed by
subtraction. This is recommended by Korn and Graubard (1999) and is correct
for PSU-level covariates but is potentially very conservative for
individual-level covariates. To get tests based on a Normal distribution
use df.resid=Inf, and to use number of PSUs-number of strata,
specify df.resid=degf(design).
When std.errors="Bell-McCaffrey(-2)" option is specified
for clustered svydesign, Bell-McCaffrey standard errors are produced
that adjust for some of the known downward biases of linearized
standard errors. Bell and McAffrey (2002) also suggest corrections
for the degrees of freedom, which end up being even more conservative
than the (# of PSUs)-(# of strata) design degrees of freedom.
By default, the computation of these degrees of freedom adjustments
is skipped (degf=FALSE) as they require dealing with
projection matrices of size nrow(svydesign$variables) by
nrow(svydesign$variables). The option
std.errors="Bell-McCaffrey" produces a version with unit working
residual covariance matrix, and option std.errors="Bell-McCaffrey-2"
produces a version with the exchangeable correlation working
residual covariance matrix, recommended by Imbens and Kolesar (2016)
(typically more conservative). The standard errors themselves
are identical between these two options.
Parallel processing with multicore=TRUE is helpful only for
fairly large data sets and on computers with sufficient memory. It may
be incompatible with GUIs, although the Mac Aqua GUI appears to be safe.
predict gives fitted values and sampling variability for specific new
values of covariates. When newdata are the population mean it
gives the regression estimator of the mean, and when newdata are
the population totals and total is specified it gives the
regression estimator of the population total. Regression estimators of
mean and total can also be obtained with calibrate.
When the model is not of full rank, so that some coefficients are
NA, point predictions will be made by setting those coefficients
to zero. Standard error and variance estimates will be NA.
Note
svyglm always returns 'model-robust' standard errors; the
Horvitz-Thompson-type standard errors used everywhere in the survey
package are a generalisation of the model-robust 'sandwich' estimators.
In particular, a quasi-Poisson svyglm will return correct
standard errors for relative risk regression models.
Note
This function does not return the same standard error estimates for the regression estimator of population mean and total as some textbooks, or SAS. However, it does give the same standard error estimator as estimating the mean or total with calibrated weights.
In particular, under simple random sampling with or without replacement
there is a simple rescaling of the mean squared residual to estimate the
mean squared error of the regression estimator. The standard error
estimate produced by predict.svyglm has very similar
(asymptotically identical) expected
value to the textbook estimate, and has the advantage of being
applicable when the supplied newdata are not the population mean
of the predictors. The difference is small when the sample size is large, but can be
appreciable for small samples.
You can obtain the other standard error estimator by calling
predict.svyglm with the covariates set to their estimated (rather
than true) population mean values.
Value
svyglm returns an object of class svyglm. The
predict method returns an object of class svystat if
se.fit is TRUE, otherwise just a numeric vector
See also
glm, which is used to do most of the work.
regTermTest, for multiparameter tests
calibrate, for an alternative way to specify regression
estimators of population totals or means
svyttest for one-sample and two-sample t-tests.
References
Robert M. Bell and Daniel F. McCaffrey (2002). Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples. Survey Methodology 28 (2), 169-181. https://www150.statcan.gc.ca/n1/pub/12-001-x/2002002/article/9058-eng.pdf
David A. Binder (1983). On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review: 51(3), 279-292.
Guido W. Imbens and Michal Kolesár (2016). Robust Standard Errors in Small Samples: Some Practical Advice. The Review of Economics and Statistics, 98(4): 701-712
Edward L. Korn and Barry I. Graubard (1999). Analysis of Health Surveys. Wiley Series in Survey Methodology. Wiley: Hoboken, NJ.
Thomas Lumley and Alastair J. Scott (2017). Fitting Regression Models to Survey Data. Statistical Science 32: 265-278.
Examples
data(api)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
rstrat<-as.svrepdesign(dstrat)
rclus2<-as.svrepdesign(dclus2)
summary(svyglm(api00~ell+meals+mobility, design=dstrat))
#>
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dstrat)
#>
#> Survey design:
#> svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat,
#> fpc = ~fpc)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 820.8873 10.0777 81.456 <2e-16 ***
#> ell -0.4806 0.3920 -1.226 0.222
#> meals -3.1415 0.2839 -11.064 <2e-16 ***
#> mobility 0.2257 0.3932 0.574 0.567
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for gaussian family taken to be 5171.966)
#>
#> Number of Fisher Scoring iterations: 2
#>
summary(svyglm(api00~ell+meals+mobility, design=dclus2))
#>
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dclus2)
#>
#> Survey design:
#> svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 811.4907 30.8795 26.279 <2e-16 ***
#> ell -2.0592 1.4075 -1.463 0.152
#> meals -1.7772 1.1053 -1.608 0.117
#> mobility 0.3253 0.5305 0.613 0.544
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for gaussian family taken to be 8363.101)
#>
#> Number of Fisher Scoring iterations: 2
#>
summary(svyglm(api00~ell+meals+mobility, design=rstrat))
#>
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = rstrat)
#>
#> Survey design:
#> as.svrepdesign.default(dstrat)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 820.8873 10.5190 78.038 <2e-16 ***
#> ell -0.4806 0.4060 -1.184 0.238
#> meals -3.1415 0.2939 -10.691 <2e-16 ***
#> mobility 0.2257 0.4515 0.500 0.618
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for gaussian family taken to be 5146.106)
#>
#> Number of Fisher Scoring iterations: 2
#>
summary(svyglm(api00~ell+meals+mobility, design=rclus2))
#>
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = rclus2)
#>
#> Survey design:
#> as.svrepdesign.default(dclus2)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 811.4907 37.1474 21.845 <2e-16 ***
#> ell -2.0592 1.6177 -1.273 0.211
#> meals -1.7772 1.4901 -1.193 0.241
#> mobility 0.3253 0.6307 0.516 0.609
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for gaussian family taken to be 8296.727)
#>
#> Number of Fisher Scoring iterations: 2
#>
## standard errors corrected up
summary(svyglm(api00~ell+meals+mobility, design=dstrat, std.errors="Bell-McCaffrey"))
#>
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dstrat,
#> std.errors = "Bell-McCaffrey")
#>
#> Survey design:
#> svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat,
#> fpc = ~fpc)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 820.8873 10.3016 79.685 <2e-16 ***
#> ell -0.4806 0.3994 -1.203 0.230
#> meals -3.1415 0.2894 -10.856 <2e-16 ***
#> mobility 0.2257 0.4227 0.534 0.594
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for gaussian family taken to be 5171.966)
#>
#> Number of Fisher Scoring iterations: 2
#>
summary(svyglm(api00~ell+meals+mobility, design=dclus2, std.errors="Bell-McCaffrey"))
#>
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dclus2,
#> std.errors = "Bell-McCaffrey")
#>
#> Survey design:
#> svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 811.4907 32.9179 24.652 <2e-16 ***
#> ell -2.0592 1.4064 -1.464 0.152
#> meals -1.7772 1.1161 -1.592 0.120
#> mobility 0.3253 0.9481 0.343 0.734
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for gaussian family taken to be 8363.101)
#>
#> Number of Fisher Scoring iterations: 2
#>
## not applicable to replicate designs
## use quasibinomial, quasipoisson to avoid warning messages
summary(svyglm(sch.wide~ell+meals+mobility, design=dstrat,
family=quasibinomial()))
#>
#> Call:
#> svyglm(formula = sch.wide ~ ell + meals + mobility, design = dstrat,
#> family = quasibinomial())
#>
#> Survey design:
#> svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat,
#> fpc = ~fpc)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.835837 0.455626 1.834 0.0681 .
#> ell -0.002490 0.013252 -0.188 0.8512
#> meals -0.003152 0.009199 -0.343 0.7322
#> mobility 0.060897 0.031935 1.907 0.0580 .
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> (Dispersion parameter for quasibinomial family taken to be 1.108677)
#>
#> Number of Fisher Scoring iterations: 5
#>
## Compare regression and ratio estimation of totals
api.ratio <- svyratio(~api.stu,~enroll, design=dstrat)
pop<-data.frame(enroll=sum(apipop$enroll, na.rm=TRUE))
npop <- nrow(apipop)
predict(api.ratio, pop$enroll)
#> $total
#> enroll
#> api.stu 3190038
#>
#> $se
#> enroll
#> api.stu 29565.98
#>
## regression estimator is less efficient
api.reg <- svyglm(api.stu~enroll, design=dstrat)
predict(api.reg, newdata=pop, total=npop)
#> link SE
#> 1 3187252 31072
## same as calibration estimator
svytotal(~api.stu, calibrate(dstrat, ~enroll, pop=c(npop, pop$enroll)))
#> total SE
#> api.stu 3187252 31072
## svyglm can also reproduce the ratio estimator
api.reg2 <- svyglm(api.stu~enroll-1, design=dstrat,
family=quasi(link="identity",var="mu"))
predict(api.reg2, newdata=pop, total=npop)
#> link SE
#> 1 3190038 29566
## higher efficiency by modelling variance better
api.reg3 <- svyglm(api.stu~enroll-1, design=dstrat,
family=quasi(link="identity",var="mu^3"))
predict(api.reg3, newdata=pop, total=npop)
#> link SE
#> 1 3247986 21129
## true value
sum(apipop$api.stu)
#> [1] 3196602