Survey-weighted generalised linear models.

Fit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors.

Usage

# S3 method for class 'survey.design'
svyglm(formula, design, subset=NULL,
    family=stats::gaussian(),start=NULL, rescale=TRUE, ..., deff=FALSE, influence=FALSE, 
    std.errors=c("linearized","Bell-McCaffrey","Bell-McCaffrey-2"),degf=FALSE)
# S3 method for class 'svyrep.design'
svyglm(formula, design, subset=NULL,
    family=stats::gaussian(),start=NULL, rescale=NULL, ..., rho=NULL,
    return.replicates=FALSE, na.action,multicore=getOption("survey.multicore"))
# S3 method for class 'svyglm'
summary(object, correlation = FALSE, df.resid=NULL, ...)
# S3 method for class 'svyglm'
predict(object,newdata=NULL,total=NULL,
                         type=c("link","response","terms"),
                         se.fit=(type != "terms"),vcov=FALSE,...)
# S3 method for class 'svrepglm'
predict(object,newdata=NULL,total=NULL,
                         type=c("link","response","terms"),
                         se.fit=(type != "terms"),vcov=FALSE,
                         return.replicates=!is.null(object$replicates),...)

Arguments

formula: Model formula
design: Survey design from svydesign or svrepdesign. Must contain all variables in the formula
subset: Expression to select a subpopulation
family: family object for glm
start: Starting values for the coefficients (needed for some uncommon link/family combinations)
rescale: Rescaling of weights, to improve numerical stability. The default rescales weights to sum to the sample size. Use FALSE to not rescale weights. For replicate-weight designs, use TRUE to rescale weights to sum to 1, as was the case before version 3.34.
...: Other arguments passed to glm or summary.glm
rho: For replicate BRR designs, to specify the parameter for Fay's variance method, giving weights of rho and 2-rho
return.replicates: Return the replicates as the replicates component of the result? (for predict, only possible if they were computed in the svyglm fit)
deff: Estimate the design effects
influence: Return influence functions
std.errors: The kind of standard errors to compute
degf: Whether to compute the adjusted degrees of freedom along with Bell-McCaffrey standard errors
object: A svyglm object
correlation: Include the correlation matrix of parameters?
na.action: Handling of NAs
multicore: Use the multicore package to distribute replicates across processors?
df.resid: Optional denominator degrees of freedom for Wald tests
newdata: new data frame for prediction
total: population size when predicting population total
type: linear predictor (link) or response
se.fit: if TRUE, return variances of predictions
vcov: if TRUE and se=TRUE return full variance-covariance matrix of predictions

Details

For binomial and Poisson families use family=quasibinomial() and family=quasipoisson() to avoid a warning about non-integer numbers of successes. The `quasi' versions of the family objects give the same point estimates and standard errors and do not give the warning.

If df.resid is not specified the df for the null model is computed by degf and the residual df computed by subtraction. This is recommended by Korn and Graubard (1999) and is correct for PSU-level covariates but is potentially very conservative for individual-level covariates. To get tests based on a Normal distribution use df.resid=Inf, and to use number of PSUs-number of strata, specify df.resid=degf(design).

When std.errors="Bell-McCaffrey(-2)" option is specified for clustered svydesign, Bell-McCaffrey standard errors are produced that adjust for some of the known downward biases of linearized standard errors. Bell and McAffrey (2002) also suggest corrections for the degrees of freedom, which end up being even more conservative than the (# of PSUs)-(# of strata) design degrees of freedom. By default, the computation of these degrees of freedom adjustments is skipped (degf=FALSE) as they require dealing with projection matrices of size nrow(svydesign$variables) by nrow(svydesign$variables). The option std.errors="Bell-McCaffrey" produces a version with unit working residual covariance matrix, and option std.errors="Bell-McCaffrey-2" produces a version with the exchangeable correlation working residual covariance matrix, recommended by Imbens and Kolesar (2016) (typically more conservative). The standard errors themselves are identical between these two options.

Parallel processing with multicore=TRUE is helpful only for fairly large data sets and on computers with sufficient memory. It may be incompatible with GUIs, although the Mac Aqua GUI appears to be safe.

predict gives fitted values and sampling variability for specific new values of covariates. When newdata are the population mean it gives the regression estimator of the mean, and when newdata are the population totals and total is specified it gives the regression estimator of the population total. Regression estimators of mean and total can also be obtained with calibrate.

When the model is not of full rank, so that some coefficients are NA, point predictions will be made by setting those coefficients to zero. Standard error and variance estimates will be NA.

Note

svyglm always returns 'model-robust' standard errors; the Horvitz-Thompson-type standard errors used everywhere in the survey package are a generalisation of the model-robust 'sandwich' estimators. In particular, a quasi-Poisson svyglm will return correct standard errors for relative risk regression models.

Note

This function does not return the same standard error estimates for the regression estimator of population mean and total as some textbooks, or SAS. However, it does give the same standard error estimator as estimating the mean or total with calibrated weights.

In particular, under simple random sampling with or without replacement there is a simple rescaling of the mean squared residual to estimate the mean squared error of the regression estimator. The standard error estimate produced by predict.svyglm has very similar (asymptotically identical) expected value to the textbook estimate, and has the advantage of being applicable when the supplied newdata are not the population mean of the predictors. The difference is small when the sample size is large, but can be appreciable for small samples.

You can obtain the other standard error estimator by calling predict.svyglm with the covariates set to their estimated (rather than true) population mean values.

Value

svyglm returns an object of class svyglm. The predict method returns an object of class svystat if se.fit is TRUE, otherwise just a numeric vector

Author

Thomas Lumley

References

Robert M. Bell and Daniel F. McCaffrey (2002). Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples. Survey Methodology 28 (2), 169-181. https://www150.statcan.gc.ca/n1/pub/12-001-x/2002002/article/9058-eng.pdf

David A. Binder (1983). On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review: 51(3), 279-292.

Guido W. Imbens and Michal Kolesár (2016). Robust Standard Errors in Small Samples: Some Practical Advice. The Review of Economics and Statistics, 98(4): 701-712

Edward L. Korn and Barry I. Graubard (1999). Analysis of Health Surveys. Wiley Series in Survey Methodology. Wiley: Hoboken, NJ.

Thomas Lumley and Alastair J. Scott (2017). Fitting Regression Models to Survey Data. Statistical Science 32: 265-278.

Examples


  data(api)


  dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
  dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
  rstrat<-as.svrepdesign(dstrat)
  rclus2<-as.svrepdesign(dclus2)

  summary(svyglm(api00~ell+meals+mobility, design=dstrat))
#> 
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dstrat)
#> 
#> Survey design:
#> svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, 
#>     fpc = ~fpc)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 820.8873    10.0777  81.456   <2e-16 ***
#> ell          -0.4806     0.3920  -1.226    0.222    
#> meals        -3.1415     0.2839 -11.064   <2e-16 ***
#> mobility      0.2257     0.3932   0.574    0.567    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 5171.966)
#> 
#> Number of Fisher Scoring iterations: 2
#> 
  summary(svyglm(api00~ell+meals+mobility, design=dclus2))
#> 
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dclus2)
#> 
#> Survey design:
#> svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 811.4907    30.8795  26.279   <2e-16 ***
#> ell          -2.0592     1.4075  -1.463    0.152    
#> meals        -1.7772     1.1053  -1.608    0.117    
#> mobility      0.3253     0.5305   0.613    0.544    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 8363.101)
#> 
#> Number of Fisher Scoring iterations: 2
#> 
  summary(svyglm(api00~ell+meals+mobility, design=rstrat))
#> 
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = rstrat)
#> 
#> Survey design:
#> as.svrepdesign.default(dstrat)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 820.8873    10.5190  78.038   <2e-16 ***
#> ell          -0.4806     0.4060  -1.184    0.238    
#> meals        -3.1415     0.2939 -10.691   <2e-16 ***
#> mobility      0.2257     0.4515   0.500    0.618    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 5146.106)
#> 
#> Number of Fisher Scoring iterations: 2
#> 
  summary(svyglm(api00~ell+meals+mobility, design=rclus2))
#> 
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = rclus2)
#> 
#> Survey design:
#> as.svrepdesign.default(dclus2)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 811.4907    37.1474  21.845   <2e-16 ***
#> ell          -2.0592     1.6177  -1.273    0.211    
#> meals        -1.7772     1.4901  -1.193    0.241    
#> mobility      0.3253     0.6307   0.516    0.609    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 8296.727)
#> 
#> Number of Fisher Scoring iterations: 2
#> 

  ## standard errors corrected up
  summary(svyglm(api00~ell+meals+mobility, design=dstrat, std.errors="Bell-McCaffrey"))
#> 
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dstrat, 
#>     std.errors = "Bell-McCaffrey")
#> 
#> Survey design:
#> svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, 
#>     fpc = ~fpc)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 820.8873    10.3016  79.685   <2e-16 ***
#> ell          -0.4806     0.3994  -1.203    0.230    
#> meals        -3.1415     0.2894 -10.856   <2e-16 ***
#> mobility      0.2257     0.4227   0.534    0.594    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 5171.966)
#> 
#> Number of Fisher Scoring iterations: 2
#> 
  summary(svyglm(api00~ell+meals+mobility, design=dclus2, std.errors="Bell-McCaffrey"))
#> 
#> Call:
#> svyglm(formula = api00 ~ ell + meals + mobility, design = dclus2, 
#>     std.errors = "Bell-McCaffrey")
#> 
#> Survey design:
#> svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 811.4907    32.9179  24.652   <2e-16 ***
#> ell          -2.0592     1.4064  -1.464    0.152    
#> meals        -1.7772     1.1161  -1.592    0.120    
#> mobility      0.3253     0.9481   0.343    0.734    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 8363.101)
#> 
#> Number of Fisher Scoring iterations: 2
#> 
  ## not applicable to replicate designs

  ## use quasibinomial, quasipoisson to avoid warning messages
  summary(svyglm(sch.wide~ell+meals+mobility, design=dstrat,
        family=quasibinomial()))
#> 
#> Call:
#> svyglm(formula = sch.wide ~ ell + meals + mobility, design = dstrat, 
#>     family = quasibinomial())
#> 
#> Survey design:
#> svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, 
#>     fpc = ~fpc)
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)  0.835837   0.455626   1.834   0.0681 .
#> ell         -0.002490   0.013252  -0.188   0.8512  
#> meals       -0.003152   0.009199  -0.343   0.7322  
#> mobility     0.060897   0.031935   1.907   0.0580 .
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for quasibinomial family taken to be 1.108677)
#> 
#> Number of Fisher Scoring iterations: 5
#> 


  ## Compare regression and ratio estimation of totals
  api.ratio <- svyratio(~api.stu,~enroll, design=dstrat)
  pop<-data.frame(enroll=sum(apipop$enroll, na.rm=TRUE))
  npop <- nrow(apipop)
  predict(api.ratio, pop$enroll)
#> $total
#>          enroll
#> api.stu 3190038
#> 
#> $se
#>           enroll
#> api.stu 29565.98
#> 

  ## regression estimator is less efficient
  api.reg <- svyglm(api.stu~enroll, design=dstrat)
  predict(api.reg, newdata=pop, total=npop)
#>      link    SE
#> 1 3187252 31072
  ## same as calibration estimator
  svytotal(~api.stu, calibrate(dstrat, ~enroll, pop=c(npop, pop$enroll)))
#>           total    SE
#> api.stu 3187252 31072

  ## svyglm can also reproduce the ratio estimator
  api.reg2 <- svyglm(api.stu~enroll-1, design=dstrat,
                    family=quasi(link="identity",var="mu"))
  predict(api.reg2, newdata=pop, total=npop)
#>      link    SE
#> 1 3190038 29566

  ## higher efficiency by modelling variance better
  api.reg3 <- svyglm(api.stu~enroll-1, design=dstrat,
                    family=quasi(link="identity",var="mu^3"))
  predict(api.reg3, newdata=pop, total=npop)
#>      link    SE
#> 1 3247986 21129
  ## true value
  sum(apipop$api.stu)
#> [1] 3196602

Survey-weighted generalised linear models.

Usage

Arguments

Details

Note

Note

Value

Author

See also

References

Examples