Linear Model Estimation Using Ordinary Least Squares
ols.RdFits the usual weighted or unweighted linear regression model using the
same fitting routines used by lm, but also storing the variance-covariance
matrix var and using traditional dummy-variable coding for categorical
factors.
Also fits unweighted models using penalized least squares, with the same
penalization options as in the lrm function. For penalized estimation,
there is a fitter function call lm.pfit.
Usage
ols(formula, data=environment(formula), weights, subset, na.action=na.delete,
method="qr", model=FALSE,
x=FALSE, y=FALSE, se.fit=FALSE, linear.predictors=TRUE,
penalty=0, penalty.matrix, tol=.Machine$double.eps, sigma,
var.penalty=c('simple','sandwich'), ...)Arguments
- formula
an S formula object, e.g.
Y ~ rcs(x1,5)*lsp(x2,c(10,20))- data
name of an S data frame containing all needed variables. Omit this to use a data frame already in the S “search list”.
- weights
an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights
weights(that is, minimizing \(sum(w*e^2)\)); otherwise ordinary least squares is used.- subset
an expression defining a subset of the observations to use in the fit. The default is to use all observations. Specify for example
age>50 & sex="male"orc(1:100,200:300)respectively to use the observations satisfying a logical expression or those having row numbers in the given vector.- na.action
specifies an S function to handle missing data. The default is the function
na.delete, which causes observations with any variable missing to be deleted. The main difference betweenna.deleteand the S-supplied functionna.omitis thatna.deletemakes a list of the number of observations that are missing on each variable in the model. Thena.actionis usally specified by e.g.options(na.action="na.delete").- method
specifies a particular fitting method, or
"model.frame"instead to return the model frame of the predictor and response variables satisfying any subset or missing value checks.- model
default is
FALSE. Set toTRUEto return the model frame as elementmodelof the fit object.- x
default is
FALSE. Set toTRUEto return the expanded design matrix as elementx(without intercept indicators) of the returned fit object. Set bothx=TRUEif you are going to use theresidualsfunction later to return anything other than ordinary residuals.- y
default is
FALSE. Set toTRUEto return the vector of response values as elementyof the fit.- se.fit
default is
FALSE. Set toTRUEto compute the estimated standard errors of the estimate of \(X\beta\) and store them in elementse.fitof the fit.- linear.predictors
set to
FALSEto cause predicted values not to be stored- penalty
see
lrm- penalty.matrix
see
lrm- tol
tolerance for information matrix singularity
- sigma
If
sigmais given, it is taken as the actual root mean squared error parameter for the model. Otherwisesigmais estimated from the data using the usual formulas (except for penalized models). It is often convenient to specifysigma=1for models with no error, when usingfastbwto find an approximate model that predicts predicted values from the full model with a given accuracy.- var.penalty
the type of variance-covariance matrix to be stored in the
varcomponent of the fit when penalization is used. The default is the inverse of the penalized information matrix. Specifyvar.penalty="sandwich"to use the sandwich estimator (see below undervar), which limited simulation studies have shown yields variances estimates that are too low.- ...
Value
the same objects returned from lm (unless penalty or penalty.matrix
are given - then an
abbreviated list is returned since lm.pfit is used as a fitter)
plus the design attributes
(see rms).
Predicted values are always returned, in the element linear.predictors.
The vectors or matrix stored if y=TRUE or x=TRUE have rows deleted according to subset and
to missing data, and have names or row names that come from the
data frame used as input data. If penalty or penalty.matrix is given,
the var matrix
returned is an improved variance-covariance matrix
for the penalized regression coefficient estimates. If
var.penalty="sandwich" (not the default, as limited simulation
studies have found it provides variance estimates that are too low) it
is defined as
\(\sigma^{2} (X'X + P)^{-1} X'X (X'X + P)^{-1}\), where \(P\) is
penalty factors * penalty.matrix, with a column and row of zeros
added for the
intercept. When var.penalty="simple" (the default), var is
\(\sigma^{2} (X'X + P)^{-1}\).
The returned list has a vector stats with named elements
n, Model L.R., d.f., R2, g, Sigma. Model L.R. is the model
likelihood ratio \(\chi^2\) statistic, and R2 is
\(R^2\). For penalized estimation, d.f. is the
effective degrees of freedom, which is the sum of the elements of another
vector returned, effective.df.diagonal, minus one for the
intercept.
g is the \(g\)-index.
Sigma is the penalized maximum likelihood estimate (see below).
Details
For penalized estimation, the penalty factor on the log likelihood is
\(-0.5 \beta' P \beta / \sigma^2\), where \(P\) is defined above.
The penalized maximum likelihood estimate (penalized least squares
or ridge estimate) of \(\beta\) is \((X'X + P)^{-1} X'Y\).
The maximum likelihood estimate of \(\sigma^2\) is \((sse + \beta'
P \beta) / n\), where
sse is the sum of squared errors (residuals).
The effective.df.diagonal vector is the
diagonal of the matrix \(X'X/(sse/n) \sigma^{2} (X'X + P)^{-1}\).
Examples
set.seed(1)
x1 <- runif(200)
x2 <- sample(0:3, 200, TRUE)
distance <- (x1 + x2/3 + rnorm(200))^2
d <- datadist(x1,x2)
options(datadist="d") # No d -> no summary, plot without giving all details
f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
#> Error in Design(X, formula = formula): dataset d not found for options(datadist=)
# could use d <- datadist(f); options(datadist="d") at this point,
# but predictor summaries would not be stored in the fit object for
# use with Predict, summary.rms. In that case, the original
# dataset or d would need to be accessed later, or all variable values
# would have to be specified to summary, plot
anova(f)
#> Error: object 'f' not found
which.influence(f)
#> Error: object 'f' not found
summary(f)
#> Error: object 'f' not found
summary.lm(f) # will only work if penalty and penalty.matrix not used
#> Error: object 'f' not found
# Fit a complex model and approximate it with a simple one
x1 <- runif(200)
x2 <- runif(200)
x3 <- runif(200)
x4 <- runif(200)
y <- x1 + x2 + rnorm(200)
f <- ols(y ~ rcs(x1,4) + x2 + x3 + x4)
#> Error in Design(X, formula = formula): dataset d not found for options(datadist=)
pred <- fitted(f) # or predict(f) or f$linear.predictors
#> Error: object 'f' not found
f2 <- ols(pred ~ rcs(x1,4) + x2 + x3 + x4, sigma=1)
#> Error in eval(predvars, data, env): object 'pred' not found
# sigma=1 prevents numerical problems resulting from R2=1
fastbw(f2, aics=100000)
#> Error: object 'f2' not found
# This will find the best 1-variable model, best 2-variable model, etc.
# in predicting the predicted values from the original model
options(datadist=NULL)