Logistic Regression Model

Fit binary and proportional odds ordinal logistic regression models using maximum likelihood estimation or penalized maximum likelihood estimation. See cr.setup for how to fit forward continuation ratio models with lrm. The fitting function used by lrm is lrm.fit, for which details and comparisons of its various optimization methods may be found here.

For the print method, format of output is controlled by the user previously running options(prType="lang") where lang is "plain" (the default), "latex", or "html". When using html with Quarto or RMarkdown, results='asis' need not be written in the chunk header.

Usage

lrm(formula, data=environment(formula),
    subset, na.action=na.delete, method="lrm.fit",
    model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE,
    penalty=0, penalty.matrix,
    var.penalty,
    weights, normwt=FALSE, scale, ...)

# S3 method for class 'lrm'
print(x, digits=4, r2=c(0,2,4),
    coefs=TRUE, pg=FALSE,
    intercepts=x$non.slopes < 10,
    title='Logistic Regression Model', ...)

Note

When creating ordinal levels for non-integer numeric Y, this function does not use the unique function because it was found to be non-reproducible across hardware platforms. Instead, intercepts are mapped to integers obtained from rounding original Y levels after multiplying by 1e7 (by default). The only way to guarantee the correct mapping to intercepts is to reference the "official" levels of Y stored in the yunique object found in the fit object. Users may opt instead to replace original data with e.g. 1e-7 * round(Y * 1e7) before model fitting.

Arguments

formula: a formula object. An offset term can be included. The offset causes fitting of a model such as \(logit(Y=1) = X\beta + W\), where \(W\) is the offset variable having no estimated coefficient. The response variable can be any data type; lrm converts it in alphabetic or numeric order to an S factor variable and recodes it 0,1,2,... internally.
data: data frame to use. Default is the current frame.
subset: logical expression or vector of subscripts defining a subset of observations to analyze
na.action: function to handle NAs in the data. Default is na.delete, which deletes any observation having response or predictor missing, while preserving the attributes of the predictors and maintaining frequencies of deletions due to each variable in the model. This is usually specified using options(na.action="na.delete").
method: name of fitting function. Only allowable choice at present is lrm.fit.
model: causes the model frame to be returned in the fit object
x: causes the expanded design matrix (with missings excluded) to be returned under the name x. For print, an object created by lrm.
y: causes the response variable (with missings excluded) to be returned under the name y.
linear.predictors: causes the predicted X beta (with missings excluded) to be returned under the name linear.predictors. When the response variable has more than two levels, the first intercept is used.
se.fit: causes the standard errors of the fitted values to be returned under the name se.fit.
penalty: The penalty factor subtracted from the log likelihood is \(0.5 \beta' P \beta\), where \(\beta\) is the vector of regression coefficients other than intercept(s), and \(P\) is penalty factors * penalty.matrix and penalty.matrix is defined below. The default is penalty=0 implying that ordinary unpenalized maximum likelihood estimation is used. If penalty is a scalar, it is assumed to be a penalty factor that applies to all non-intercept parameters in the model. Alternatively, specify a list to penalize different types of model terms by differing amounts. The elements in this list are named simple, nonlinear, interaction and nonlinear.interaction. If you omit elements on the right of this series, values are inherited from elements on the left. Examples: penalty=list(simple=5, nonlinear=10) uses a penalty factor of 10 for nonlinear or interaction terms. penalty=list(simple=0, nonlinear=2, nonlinear.interaction=4) does not penalize linear main effects, uses a penalty factor of 2 for nonlinear or interaction effects (that are not both), and 4 for nonlinear interaction effects.
penalty.matrix: specifies the symmetric penalty matrix for non-intercept terms. The default matrix for continuous predictors has the variance of the columns of the design matrix in its diagonal elements so that the penalty to the log likelhood is unitless. For main effects for categorical predictors with \(c\) categories, the rows and columns of the matrix contain a \(c-1 \times c-1\) sub-matrix that is used to compute the sum of squares about the mean of the \(c\) parameter values (setting the parameter to zero for the reference cell) as the penalty component for that predictor. This makes the penalty independent of the choice of the reference cell. If you specify penalty.matrix, you may set the rows and columns for certain parameters to zero so as to not penalize those parameters. Depending on penalty, some elements of penalty.matrix may be overridden automatically by setting them to zero. The penalty matrix that is used in the actual fit is \(penalty \times diag(pf) \times penalty.matrix \times diag(pf)\), where \(pf\) is the vector of square roots of penalty factors computed from penalty by Penalty.setup in rmsMisc. If you specify penalty.matrix you must specify a nonzero value of penalty or no penalization will be done.
var.penalty: deprecated and ignored
weights: a vector (same length as y) of possibly fractional case weights
normwt: set to TRUE to scale weights so they sum to the length of y; useful for sample surveys as opposed to the default of frequency weighting
scale: deprecated; see lrm.fit transx argument
...: arguments that are passed to lrm.fit, or from print, to prModFit
digits: number of significant digits to use
r2: vector of integers specifying which R^2 measures to print, with 0 for Nagelkerke R^2 and 1:4 corresponding to the 4 measures computed by R2Measures. Default is to print Nagelkerke (labeled R2) and second and fourth R2Measures which are the measures adjusted for the number of predictors, first for the raw sample size then for the effective sample size, which here is from the formula for the approximate variance of a log odds ratio in a proportional odds model.
coefs: specify coefs=FALSE to suppress printing the table of model coefficients, standard errors, etc. Specify coefs=n to print only the first n regression coefficients in the model.
pg: set to TRUE to print g-indexes
intercepts: controls printing of intercepts. By default they are only printed if there aren't more than 10 of them.
title: a character string title to be passed to prModFit

Value

The returned fit object of lrm contains the following components in addition to the ones mentioned under the optional arguments.

call: calling expression
freq: table of frequencies for Y in order of increasing Y
stats: vector with the following elements: number of observations used in the fit, maximum absolute value of first derivative of log likelihood, model likelihood ratio \(\chi^2\), d.f., \(P\)-value, \(c\) index (area under ROC curve), Somers' \(D_{xy}\), Goodman-Kruskal \(\gamma\), Kendall's \(\tau_a\) rank correlations between predicted probabilities and observed response, the Nagelkerke \(R^2\) index, the Brier score computed with respect to \(Y >\) its lowest level, the \(g\)-index, \(gr\) (the \(g\)-index on the odds ratio scale), and \(gp\) (the \(g\)-index on the probability scale using the same cutoff used for the Brier score). Probabilities are rounded to the nearest 0.0002 in the computations or rank correlation indexes. In the case of penalized estimation, the "Model L.R." is computed without the penalty factor, and "d.f." is the effective d.f. from Gray's (1992) Equation 2.9. The \(P\)-value uses this corrected model L.R. \(\chi^2\) and corrected d.f. The score chi-square statistic uses first derivatives which contain penalty components.
fail: set to TRUE if convergence failed (and maxiter>1)
coefficients: estimated parameters
var: estimated variance-covariance matrix (inverse of information matrix). If penalty>0, var is either the inverse of the penalized information matrix.
effective.df.diagonal: is returned if penalty>0. It is the vector whose sum is the effective d.f. of the model (counting intercept terms).
u: vector of first derivatives of log-likelihood
deviance: -2 log likelihoods (counting penalty components) When an offset variable is present, three deviances are computed: for intercept(s) only, for intercepts+offset, and for intercepts+offset+predictors. When there is no offset variable, the vector contains deviances for the intercept(s)-only model and the model with intercept(s) and predictors.
est: vector of column numbers of X fitted (intercepts are not counted)
non.slopes: number of intercepts in model
penalty: see above
penalty.matrix: the penalty matrix actually used in the estimation

Author

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

References

Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Applied Statistics 41:191–201, 1992.

Verweij PJM, Van Houwelingen JC: Penalized likelihood in Cox regression. Stat in Med 13:2427–2436, 1994.

Gray RJ: Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. JASA 87:942–951, 1992.

Shao J: Linear model selection by cross-validation. JASA 88:486–494, 1993.

Verweij PJM, Van Houwelingen JC: Crossvalidation in survival analysis. Stat in Med 12:2305–2314, 1993.

Harrell FE: Model uncertainty, penalization, and parsimony. ISCB Presentation on UVa Web page, 1998.

Examples

#Fit a logistic model containing predictors age, blood.pressure, sex
#and cholesterol, with age fitted with a smooth 5-knot restricted cubic
#spline function and a different shape of the age relationship for males
#and females.  As an intermediate step, predict mean cholesterol from
#age using a proportional odds ordinal logistic model
#
require(ggplot2)
n <- 1000    # define sample size
set.seed(17) # so can reproduce the results
age            <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol    <- rnorm(n, 200, 25)
sex            <- factor(sample(c('female','male'), n,TRUE))
label(age)            <- 'Age'      # label is in Hmisc
label(cholesterol)    <- 'Total Cholesterol'
label(blood.pressure) <- 'Systolic Blood Pressure'
label(sex)            <- 'Sex'
units(cholesterol)    <- 'mg/dl'   # uses units.default in Hmisc
units(blood.pressure) <- 'mmHg'

# Group cholesterol unnecessarily into 40-tiles
ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals
table(ch)
#> ch
#> 143.16 155.16 162.68 166.81 170.68 173.56 176.95 179.04 181.17 182.83 184.91 
#>     25     25     25     25     25     25     25     25     25     25     25 
#> 186.74 188.48 190.09 191.54 193.23 195.04 196.62 198.16 199.93 201.07 202.78 
#>     25     25     25     25     25     25     25     25     25     25     25 
#> 204.92 206.49 208.08 209.53 211.16 212.71 214.12 215.80 217.42 219.50 221.76 
#>     25     25     25     25     25     25     25     25     25     25     25 
#> 224.28 227.32 230.02 233.76 238.77 245.69 255.30 
#>     25     25     25     25     25     25     25 
f <- lrm(ch ~ age)
# options(prType='latex')
print(f)  # write latex code to console if prType='latex' is in effect
#> Logistic Regression Model
#> 
#> lrm(formula = ch ~ age)
#> 
#>                       Model Likelihood      Discrimination    Rank Discrim.    
#>                             Ratio Test             Indexes          Indexes    
#> Obs          1000    LR chi2      0.19      R2       0.000    C       0.511    
#> max |deriv| 4e-06    d.f.            1     R2(1,1000)0.000    Dxy     0.021    
#>                      Pr(> chi2) 0.6651    R2(1,999.4)0.000    gamma   0.021    
#>                                             Brier    0.249    tau-a   0.021    
#> 
#>     Coef    S.E.   Wald Z Pr(>|Z|)
#> age -0.0023 0.0054 -0.43  0.6651  
#> 
m <- Mean(f)    # see help file for Mean.lrm
d <- data.frame(age=seq(0,90,by=10))
m(predict(f, d))
#>      1      2      3      4      5      6      7      8      9     10 
#> 202.06 201.73 201.41 201.09 200.76 200.44 200.12 199.79 199.47 199.14 
# Repeat using ols
f <- ols(cholesterol ~ age)
predict(f, d)
#>      1      2      3      4      5      6      7      8      9     10 
#> 201.46 201.26 201.05 200.85 200.64 200.44 200.23 200.03 199.82 199.61 

# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
     (log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
cholesterol[1:3] <- NA   # 3 missings, at random

ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')

fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
               x=TRUE, y=TRUE)
#> Error in Design(data, formula = formula): dataset ddist not found for options(datadist=)
#      x=TRUE, y=TRUE allows use of resid(), which.influence below
#      could define d <- datadist(fit) after lrm(), but data distribution
#      summary would not be stored with fit, so later uses of Predict
#      or summary.rms would require access to the original dataset or
#      d or specifying all variable values to summary, Predict, nomogram
anova(fit)
#> Error: object 'fit' not found
p <- Predict(fit, age, sex)
#> Error: object 'fit' not found
ggplot(p)   # or plot()
#> Error: object 'p' not found
ggplot(Predict(fit, age=20:70, sex="male"))   # need if datadist not used
#> Error: object 'fit' not found
print(cbind(resid(fit,"dfbetas"), resid(fit,"dffits"))[1:20,])
#> Error: object 'fit' not found
which.influence(fit, .3)
#> Error: object 'fit' not found
# latex(fit)                       #print nice statement of fitted model
#
#Repeat this fit using penalized MLE, penalizing complex terms
#(for nonlinear or interaction effects)
#
fitp <- update(fit, penalty=list(simple=0,nonlinear=10), x=TRUE, y=TRUE)
#> Error: object 'fit' not found
effective.df(fitp)
#> Error: object 'fitp' not found
# or lrm(y ~ \dots, penalty=\dots)


#Get fits for a variety of penalties and assess predictive accuracy
#in a new data set.  Program efficiently so that complex design
#matrices are only created once.


set.seed(201)
x1 <- rnorm(500)
x2 <- rnorm(500)
x3 <- sample(0:1,500,rep=TRUE)
L  <- x1+abs(x2)+x3
y  <- ifelse(runif(500)<=plogis(L), 1, 0)
new.data <- data.frame(x1,x2,x3,y)[301:500,]
#
for(penlty in seq(0,.15,by=.005)) {
  if(penlty==0) {
    f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, subset=1:300, x=TRUE, y=TRUE)
    # True model is linear in x1 and has no interaction
    X <- f$x    # saves time for future runs - don't have to use rcs etc.
    Y <- f$y    # this also deletes rows with NAs (if there were any)
    penalty.matrix <- diag(diag(var(X)))
    Xnew <- predict(f, new.data, type="x")
    # expand design matrix for new data
    Ynew <- new.data$y
  } else f <- lrm.fit(X,Y, penalty.matrix=penlty*penalty.matrix)
#
  cat("\nPenalty :",penlty,"\n")
  pred.logit <- f$coef[1] + (Xnew %*% f$coef[-1])
  pred <- plogis(pred.logit)
  C.index <- somers2(pred, Ynew)["C"]
  Brier   <- mean((pred-Ynew)^2)
  Deviance<- -2*sum( Ynew*log(pred) + (1-Ynew)*log(1-pred) )
  cat("ROC area:",format(C.index),"   Brier score:",format(Brier),
      "   -2 Log L:",format(Deviance),"\n")
}
#> Error in Design(data, formula = formula): dataset ddist not found for options(datadist=)
#penalty=0.045 gave lowest -2 Log L, Brier, ROC in test sample for S+
#
#Use bootstrap validation to estimate predictive accuracy of
#logistic models with various penalties
#To see how noisy cross-validation estimates can be, change the
#validate(f, \dots) to validate(f, method="cross", B=10) for example.
#You will see tremendous variation in accuracy with minute changes in
#the penalty.  This comes from the error inherent in using 10-fold
#cross validation but also because we are not fixing the splits.
#20-fold cross validation was even worse for some
#indexes because of the small test sample size.  Stability would be
#obtained by using the same sample splits for all penalty values
#(see above), but then we wouldn't be sure that the choice of the
#best penalty is not specific to how the sample was split.  This
#problem is addressed in the last example.
#
penalties <- seq(0,.7,length=3)   # really use by=.02
index <- matrix(NA, nrow=length(penalties), ncol=11,
          dimnames=list(format(penalties),
          c("Dxy","R2","Intercept","Slope","Emax","D","U","Q","B","g","gp")))
i <- 0
for(penlty in penalties)
{
  cat(penlty, "")
  i <- i+1
  if(penlty==0)
    {
    f <- lrm(y ~ rcs(x1,4)+rcs(x2,6)*x3, x=TRUE, y=TRUE)  # fit whole sample
    X <- f$x
    Y <- f$y
    penalty.matrix <- diag(diag(var(X)))   # save time - only do once
    }
  else
   f <- lrm(Y ~ X, penalty=penlty,
            penalty.matrix=penalty.matrix, x=TRUE,y=TRUE)
  val <- validate(f, method="boot", B=20)  # use larger B in practice
  index[i,] <- val[,"index.corrected"]
}
#> 0 
#> Error in Design(data, formula = formula): dataset ddist not found for options(datadist=)
par(mfrow=c(3,3))
for(i in 1:9)
{
  plot(penalties, index[,i],
       xlab="Penalty", ylab=dimnames(index)[[2]][i])
  lines(lowess(penalties, index[,i]))
}
#> Warning: no non-missing arguments to min; returning Inf
#> Warning: no non-missing arguments to max; returning -Inf
#> Error in plot.window(...): need finite 'ylim' values
options(datadist=NULL)

# Example of weighted analysis
x    <- 1:5
y    <- c(0,1,0,1,0)
reps <- c(1,2,3,2,1)
lrm(y ~ x, weights=reps)
#> Warning: currently weights are ignored in model validation and bootstrapping lrm fits
#> Logistic Regression Model
#> 
#> lrm(formula = y ~ x, weights = reps)
#> 
#> 
#> Sum of Weights by Response Category
#> 
#> 0 1 
#> 5 4 
#> 
#>                     Model Likelihood    Discrimination    Rank Discrim.    
#>                           Ratio Test           Indexes          Indexes    
#>   Obs         5    LR chi2      0.00    R2       0.000    C       0.500    
#>    0          3    d.f.            1    R2(1,9)  0.000    Dxy     0.000    
#>    1          2    Pr(> chi2) 1.0000    R2(1,6.7)0.000    tau-a   0.000    
#> Sum of weights9                         Brier    0.247                     
#>   max |deriv| 0                                                            
#> 
#>           Coef    S.E.   Wald Z Pr(>|Z|)
#> Intercept -0.2231 1.8675 -0.12  0.9049  
#> x          0.0000 0.5809  0.00  1.0000  
#> 
x <- rep(x, reps)
y <- rep(y, reps)
lrm(y ~ x)   # same as above
#> Logistic Regression Model
#> 
#> lrm(formula = y ~ x)
#> 
#>                       Model Likelihood    Discrimination    Rank Discrim.    
#>                             Ratio Test           Indexes          Indexes    
#> Obs             9    LR chi2      0.00    R2       0.000    C       0.500    
#>  0              5    d.f.            1    R2(1,9)  0.000    Dxy     0.000    
#>  1              4    Pr(> chi2) 1.0000    R2(1,6.7)0.000    gamma   0.000    
#> max |deriv| 9e-16                         Brier    0.247    tau-a   0.000    
#> 
#>           Coef    S.E.   Wald Z Pr(>|Z|)
#> Intercept -0.2231 1.8675 -0.12  0.9049  
#> x          0.0000 0.5809  0.00  1.0000  
#> 

#
#Study performance of a modified AIC which uses the effective d.f.
#See Verweij and Van Houwelingen (1994) Eq. (6).  Here AIC=chisq-2*df.
#Also try as effective d.f. equation (4) of the previous reference.
#Also study performance of Shao's cross-validation technique (which was
#designed to pick the "right" set of variables, and uses a much smaller
#training sample than most methods).  Compare cross-validated deviance
#vs. penalty to the gold standard accuracy on a 7500 observation dataset.
#Note that if you only want to get AIC or Schwarz Bayesian information
#criterion, all you need is to invoke the pentrace function.
#NOTE: the effective.df( ) function is used in practice
#
if (FALSE) { # \dontrun{
for(seed in c(339,777,22,111,3)){
# study performance for several datasets
  set.seed(seed)
  n <- 175; p <- 8
  X <- matrix(rnorm(n*p), ncol=p) # p normal(0,1) predictors
  Coef <- c(-.1,.2,-.3,.4,-.5,.6,-.65,.7)  # true population coefficients
  L <- X %*% Coef                 # intercept is zero
  Y <- ifelse(runif(n)<=plogis(L), 1, 0)
  pm <- diag(diag(var(X)))
  #Generate a large validation sample to use as a gold standard
  n.val <- 7500
  X.val <- matrix(rnorm(n.val*p), ncol=p)
  L.val <- X.val %*% Coef
  Y.val <- ifelse(runif(n.val)<=plogis(L.val), 1, 0)
  #
  Penalty <- seq(0,30,by=1)
  reps <- length(Penalty)
  effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <-
    Lpenalty <- single(reps)
  n.t <- round(n^.75)
  ncv <- c(10,20,30,40)     # try various no. of reps in cross-val.
  deviance <- matrix(NA,nrow=reps,ncol=length(ncv))
  #If model were complex, could have started things off by getting X, Y
  #penalty.matrix from an initial lrm fit to save time
  #
  for(i in 1:reps) {
    pen <- Penalty[i]
    cat(format(pen),"")
    f.full <- lrm.fit(X, Y, penalty.matrix=pen*pm)
    Lpenalty[i] <- pen* t(f.full$coef[-1]) %*% pm %*% f.full$coef[-1]
    f.full.nopenalty <- lrm.fit(X, Y, initial=f.full$coef, maxit=1)
    info.matrix.unpenalized <- solve(f.full.nopenalty$var)
    effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1
    lrchisq <- f.full.nopenalty$stats["Model L.R."]
    # lrm does all this penalty adjustment automatically (for var, d.f.,
    # chi-square)
    aic[i] <- lrchisq - 2*effective.df[i]
    #
    pred <- plogis(f.full$linear.predictors)
    score.matrix <- cbind(1,X) * (Y - pred)
    sum.u.uprime <- t(score.matrix) %*% score.matrix
    effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime))
    aic2[i] <- lrchisq - 2*effective.df2[i]
    #
    #Shao suggested averaging 2*n cross-validations, but let's do only 40
    #and stop along the way to see if fewer is OK
    dev <- 0
    for(j in 1:max(ncv)) {
      s    <- sample(1:n, n.t)
      cof  <- lrm.fit(X[s,],Y[s],
                      penalty.matrix=pen*pm)$coef
      pred <- cof[1] + (X[-s,] %*% cof[-1])
      dev <- dev -2*sum(Y[-s]*pred + log(1-plogis(pred)))
      for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j
    }
    #
    pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1])
    prob.val <- plogis(pred.val)
    deviance.val[i] <- -2*sum(Y.val*pred.val + log(1-prob.val))
  }
  postscript(hor=TRUE)   # along with graphics.off() below, allow plots
  par(mfrow=c(2,4))   # to be printed as they are finished
  plot(Penalty, effective.df, type="l")
  lines(Penalty, effective.df2, lty=2)
  plot(Penalty, Lpenalty, type="l")
  title("Penalty on -2 log L")
  plot(Penalty, aic, type="l")
  lines(Penalty, aic2, lty=2)
  for(k in 1:length(ncv)) {
    plot(Penalty, deviance[,k], ylab="deviance")
    title(paste(ncv[k],"reps"))
    lines(supsmu(Penalty, deviance[,k]))
  }
  plot(Penalty, deviance.val, type="l")
  title("Gold Standard (n=7500)")
  title(sub=format(seed),adj=1,cex=.5)
  graphics.off()
}
} # }
#The results showed that to obtain a clear picture of the penalty-
#accuracy relationship one needs 30 or 40 reps in the cross-validation.
#For 4 of 5 samples, though, the super smoother was able to detect
#an accurate penalty giving the best (lowest) deviance using 10-fold
#cross-validation.  Cross-validation would have worked better had
#the same splits been used for all penalties.
#The AIC methods worked just as well and are much quicker to compute.
#The first AIC based on the effective d.f. in Gray's Eq. 2.9
#(Verweij and Van Houwelingen (1994) Eq. 5 (note typo)) worked best.