Plot Effects of Variables Estimated by a Regression Model Fit Using ggplot2
ggplot.Predict.RdUses ggplot2 graphics to plot the effect of one or two predictors
on the linear predictor or X beta scale, or on some transformation of
that scale. The first argument specifies the result of the
Predict function. The predictor is always plotted in its
original coding.
If rdata is given, a spike histogram is drawn showing
the location/density of data values for the \(x\)-axis variable. If
there is a groups (superposition) variable that generated separate
curves, the data density specific to each class of points is shown.
This assumes that the second variable was a factor variable. The histograms
are drawn by histSpikeg.
To plot effects instead of estimates (e.g., treatment differences as a
function of interacting factors) see contrast.rms and
summary.rms.
Usage
# S3 method for class 'Predict'
ggplot(data, mapping, formula=NULL, groups=NULL,
aestype=c('color', 'linetype'),
conf=c('fill', 'lines'),
conflinetype=1,
varypred=FALSE, sepdiscrete=c('no', 'list', 'vertical', 'horizontal'),
subset, xlim., ylim., xlab, ylab,
colorscale=function(...) scale_color_manual(...,
values=c("#000000", "#E69F00", "#56B4E9",
"#009E73","#F0E442", "#0072B2", "#D55E00", "#CC79A7")),
colfill='black',
rdata=NULL, anova=NULL, pval=FALSE, size.anova=4,
adj.subtitle, size.adj=2.5, perim=NULL, nlevels=3,
flipxdiscrete=TRUE,
legend.position='right', legend.label=NULL,
vnames=c('labels','names'), abbrev=FALSE, minlength=6,
layout=NULL, addlayer,
histSpike.opts=list(frac=function(f) 0.01 +
0.02 * sqrt(f - 1)/sqrt(max(f, 2) - 1), side=1, nint=100),
type=NULL, ggexpr=FALSE, height=NULL, width=NULL, ..., environment)Arguments
- data
a data frame created by
Predict- mapping
kept because of
ggplotgeneric setup. If specified it will be assumed to beformula.- formula
a
ggplotfaceting formula of the formvertical variables ~ horizontal variables, with variables separated by*if there is more than one variable on a side. If omitted, the formula will be built using assumptions on the list of variables that varied in thePredictcall. When plotting multiple panels (for separate predictors),formulamay be specified but by default no formula is constructed.- groups
an optional character string containing the name of one of the variables in
datathat is to be used as a grouping (superpositioning) variable. Setgroups=FALSEto suppress superpositioning. By default, the second varying variable is used for superpositioning groups. You can also specify a length 2 string vector of variable names specifying two dimensions of superpositioning, identified by different aesthetics corresponding to theaestypeargument. When plotting effects of more than one predictor,groupsis a character string that specifies a single variable name indatathat can be used to form panels. Only applies if usingrbindto combine severalPredictresults. If there is more than onegroupsvariable, confidence bands are suppressed becauseggplot2:geom_ribbondoes not handle the aesthetics correctly.- aestype
a string vector of aesthetic names corresponding to variables in the
groupsvector. Default is to use, in order,color, andlinetype. Other permissible values aresize,shape.- conf
specify
conf="line"to show confidence bands with lines instead of filled ribbons, the default- conflinetype
specify an alternative
linetypefor confidence intervals ifconf="line"- varypred
set to
TRUEifdatais the result of passing multiplePredictresults, that represent different predictors, torbind.Predict. This will cause the.set.variable created byrbindto be copied to the.predictor.variable.- sepdiscrete
set to something other than
"no"to create separate graphics for continuous and discrete predictors. For discrete predictors, horizontal dot charts are produced. This allows use of theggplot2facet_wrapfunction to make better use of space. Ifsepdiscrete="list", a list of twogridgraphics objects is returned if both types of predictors are present (otherwise one object for the type that existed in the model). Setsepdiscrete="vertical"to put the two types of plots into one graphical object with continuous predictors on top and given a fraction of space relative to the number of continuous vs. number of discrete variables. Setsepdiscrete="horizontal"to get a horizontal arrangements with continuous variables on the left.- subset
a subsetting expression for restricting the rows of
datathat are used in plotting. For example, predictions may have been requested for males and females but one wants to plot only females.- xlim.
This parameter is seldom used, as limits are usually controlled with
Predict. Usually given as its legal abbreviationxlim. One reason to usexlimis to plot afactorvariable on the x-axis that was created with thecut2function with thelevels.meanoption, withval.lev=TRUEspecified toplot.Predict. In this case you may want the axis to have the range of the original variable values given tocut2rather than the range of the means within quantile groups.- ylim.
Range for plotting on response variable axis. Computed by default. Usually specified using its legal definition
ylim.- xlab
Label for
x-axis. Default is one given toasis, rcs, etc., which may have been the"label"attribute of the variable.- ylab
Label for
y-axis. Iffunis not given, default is"log Odds"forlrm,"log Relative Hazard"forcph, name of the response variable forols,TRUEorlog(TRUE)forpsm, or"X * Beta"otherwise. Specifyylab=NULLto omity-axis labels.- colorscale
a
ggplot2discrete scale function, e.g.function(...) scale_color_brewer(..., palette='Set1', type='qual'). The default is the colorblind-friendly palette including black in http://www.cookbook-r.com/Graphs/Colors_(ggplot2). If you get an error "insufficient values in manual scale", which occurs when there are more than 8 groups, just specifycolorscale=function(...){}to use default colors.- colfill
a single character string or number specifying the fill color to use for
geom_ribbonfor shaded confidence bands. Alpha transparency of 0.2 is applied to any color specified.- rdata
a data frame containing the original raw data on which the regression model were based, or at least containing the \(x\)-axis and grouping variable. If
rdatais present and contains the needed variables, the original data are added to the graph in the form of a spike histogram usinghistSpikegin the Hmisc package.- anova
an object returned by
anova.rms. Ifanovais specified, the overall test of association for predictor plotted is added as text to each panel, located at the spot at which the panel is most empty unless there is significant empty space at the top or bottom of the panel; these areas are given preference.- pval
specify
pval=TRUEforanovato include not only the test statistic but also the P-value- size.anova
character size for the test statistic printed on the panel, mm
- adj.subtitle
Set to
FALSEto suppress subtitling the graph with the list of settings of non-graphed adjustment values. Subtitles appear as captions withggplot2usinglabs(caption=).- size.adj
Size of adjustment settings in subtitles in mm. Default is 2.5.
- perim
perimspecifies a function having two arguments. The first is the vector of values of the first variable that is about to be plotted on the x-axis. The second argument is the single value of the variable representing different curves, for the current curve being plotted. The function's returned value must be a logical vector whose length is the same as that of the first argument, with valuesTRUEif the corresponding point should be plotted for the current curve,FALSEotherwise. See one of the latter examples.perimonly applies if predictors were specified toPredict.- nlevels
when
groupsandformulaare not specified, if any panel variable hasnlevelsor fewer values, that variable is converted to agroups(superpositioning) variable. Setnlevels=0to prevent this behavior. For other situations, a non-numeric x-axis variable withnlevelsor fewer unique values will cause a horizontal dot plot to be drawn instead of an x-y plot unlessflipxdiscrete=FALSE.- flipxdiscrete
see
nlevels- legend.position
"right"(the default for single-panel plots),"left","bottom","top", a two-element numeric vector, or"none"to suppress. For multi-panel plots the default is"top", and a legend only appears for the first (top left) panel.- legend.label
if omitted, group variable labels will be used for label the legend. Specify
legend.label=FALSEto suppress using a legend name, or a character string or expression to specify the label. Can be a vector is there is more than one grouping variable.- vnames
applies to the case where multiple plots are produced separately by predictor. Set to
'names'to use variable names instead of labels for these small plots.- abbrev
set to true to abbreviate levels of predictors that are categorical to a minimum length of
minlength- minlength
see
abbrev- layout
for multi-panel plots a 2-vector specifying the number of rows and number of columns. If omitted will be computed from the number of panels to make as square as possible.
- addlayer
a
ggplot2expression consisting of one or more layers to add to the current plot- histSpike.opts
a list containing named elements that specifies parameters to
histSpikegwhenrdatais given. Thecolparameter is usually derived from other plotting information and not specified by the user.- type
a value (
"l","p","b") to override default choices related to showing or connecting points. Especially useful for discrete x coordinate variables.- ggexpr
set to
TRUEto have the function return the character string(s) constructed to invokeggplotwithout executing the commands- height,width
used if
plotlyis in effect, to specify theplotlyimage in pixels. Default is to letplotlysize the image.- ...
ignored
- environment
ignored; used to satisfy rules because of the generic ggplot
Value
an object of class "ggplot2" ready for printing. For the
case where predictors were not specified to Predict,
sepdiscrete=TRUE, and there were both continuous and discrete
predictors in the model, a list of two graphics objects is returned.
References
Fox J, Hong J (2009): Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. J Stat Software 32 No. 1.
Note
If plotting the effects of all predictors you can reorder the
panels using for example p <- Predict(fit); p$.predictor. <-
factor(p$.predictor., v) where v is a vector of predictor
names specified in the desired order.
See also
Predict, rbind.Predict,
datadist, predictrms, anova.rms,
contrast.rms, summary.rms,
rms, rmsMisc, plot.Predict,
labcurve, histSpikeg,
ggplot, Overview
Examples
require(ggplot2)
n <- 350 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
label(age) <- 'Age' # label is in Hmisc
label(cholesterol) <- 'Total Cholesterol'
label(blood.pressure) <- 'Systolic Blood Pressure'
label(sex) <- 'Sex'
units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc
units(blood.pressure) <- 'mmHg'
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male')) +
.01 * (blood.pressure - 120)
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
fit <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
x=TRUE, y=TRUE)
#> Error in Design(data, formula = formula): dataset ddist not found for options(datadist=)
an <- anova(fit)
#> Error: object 'fit' not found
# Plot effects in two vertical sub-panels with continuous predictors on top
# ggplot(Predict(fit), sepdiscrete='vertical')
# Plot effects of all 4 predictors with test statistics from anova, and P
ggplot(Predict(fit), anova=an, pval=TRUE)
#> Error: object 'fit' not found
# ggplot(Predict(fit), rdata=llist(blood.pressure, age))
# spike histogram plot for two of the predictors
# p <- Predict(fit, name=c('age','cholesterol')) # Make 2 plots
# ggplot(p)
# p <- Predict(fit, age=seq(20,80,length=100), sex, conf.int=FALSE)
# # Plot relationship between age and log
# odds, separate curve for each sex,
# ggplot(p, subset=sex=='female' | age > 30)
# No confidence interval, suppress estimates for males <= 30
# p <- Predict(fit, age, sex)
# ggplot(p, rdata=llist(age,sex))
# rdata= allows rug plots (1-dimensional scatterplots)
# on each sex's curve, with sex-
# specific density of age
# If data were in data frame could have used that
# p <- Predict(fit, age=seq(20,80,length=100), sex='male', fun=plogis)
# works if datadist not used
# ggplot(p, ylab=expression(hat(P)))
# plot predicted probability in place of log odds
# per <- function(x, y) x >= 30
# ggplot(p, perim=per) # suppress output for age < 30 but leave scale alone
# Do ggplot2 faceting a few different ways
p <- Predict(fit, age, sex, blood.pressure=c(120,140,160),
cholesterol=c(180,200,215))
#> Error: object 'fit' not found
# ggplot(p)
ggplot(p, cholesterol ~ blood.pressure)
#> Error: object 'p' not found
# ggplot(p, ~ cholesterol + blood.pressure)
# color for sex, line type for blood.pressure:
ggplot(p, groups=c('sex', 'blood.pressure'))
#> Error: object 'p' not found
# Add legend.position='top' to allow wider plot
# Map blood.pressure to line thickness instead of line type:
# ggplot(p, groups=c('sex', 'blood.pressure'), aestype=c('color', 'size'))
# Plot the age effect as an odds ratio
# comparing the age shown on the x-axis to age=30 years
# ddist$limits$age[2] <- 30 # make 30 the reference value for age
# Could also do: ddist$limits["Adjust to","age"] <- 30
# fit <- update(fit) # make new reference value take effect
# p <- Predict(fit, age, ref.zero=TRUE, fun=exp)
# ggplot(p, ylab='Age=x:Age=30 Odds Ratio',
# addlayer=geom_hline(yintercept=1, col=gray(.8)) +
# geom_vline(xintercept=30, col=gray(.8)) +
# scale_y_continuous(trans='log',
# breaks=c(.5, 1, 2, 4, 8))))
# Compute predictions for three predictors, with superpositioning or
# conditioning on sex, combined into one graph
p1 <- Predict(fit, age, sex)
#> Error: object 'fit' not found
p2 <- Predict(fit, cholesterol, sex)
#> Error: object 'fit' not found
p3 <- Predict(fit, blood.pressure, sex)
#> Error: object 'fit' not found
p <- rbind(age=p1, cholesterol=p2, blood.pressure=p3)
#> Error: object 'p1' not found
ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE)
#> Error: object 'p' not found
# ggplot(p, groups='sex', varypred=TRUE, adj.subtitle=FALSE, sepdiscrete='vert')
if (FALSE) { # \dontrun{
# For males at the median blood pressure and cholesterol, plot 3 types
# of confidence intervals for the probability on one plot, for varying age
ages <- seq(20, 80, length=100)
p1 <- Predict(fit, age=ages, sex='male', fun=plogis) # standard pointwise
p2 <- Predict(fit, age=ages, sex='male', fun=plogis,
conf.type='simultaneous') # simultaneous
p3 <- Predict(fit, age=c(60,65,70), sex='male', fun=plogis,
conf.type='simultaneous') # simultaneous 3 pts
# The previous only adjusts for a multiplicity of 3 points instead of 100
f <- update(fit, x=TRUE, y=TRUE)
g <- bootcov(f, B=500, coef.reps=TRUE)
p4 <- Predict(g, age=ages, sex='male', fun=plogis) # bootstrap percentile
p <- rbind(Pointwise=p1, 'Simultaneous 100 ages'=p2,
'Simultaneous 3 ages'=p3, 'Bootstrap nonparametric'=p4)
# as.data.frame so will call built-in ggplot
ggplot(as.data.frame(p), aes(x=age, y=yhat)) + geom_line() +
geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0)+
facet_wrap(~ .set., ncol=2)
# Plots for a parametric survival model
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n,
rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
t <- -log(runif(n))/h
label(t) <- 'Follow-up Time'
e <- ifelse(t<=cens,1,0)
t <- pmin(t, cens)
units(t) <- "Year"
ddist <- datadist(age, sex)
require(survival)
Srv <- Surv(t,e)
# Fit log-normal survival model and plot median survival time vs. age
f <- psm(Srv ~ rcs(age), dist='lognormal')
med <- Quantile(f) # Creates function to compute quantiles
# (median by default)
p <- Predict(f, age, fun=function(x) med(lp=x))
ggplot(p, ylab="Median Survival Time")
# Note: confidence intervals from this method are approximate since
# they don't take into account estimation of scale parameter
# Fit an ols model to log(y) and plot the relationship between x1
# and the predicted mean(y) on the original scale without assuming
# normality of residuals; use the smearing estimator
# See help file for rbind.Predict for a method of showing two
# types of confidence intervals simultaneously.
# Add raw data scatterplot to graph
set.seed(1)
x1 <- runif(300)
x2 <- runif(300)
ddist <- datadist(x1, x2); options(datadist='ddist')
y <- exp(x1 + x2 - 1 + rnorm(300))
f <- ols(log(y) ~ pol(x1,2) + x2)
r <- resid(f)
smean <- function(yhat)smearingEst(yhat, exp, res, statistic='mean')
formals(smean) <- list(yhat=numeric(0), res=r[! is.na(r)])
#smean$res <- r[! is.na(r)] # define default res argument to function
ggplot(Predict(f, x1, fun=smean), ylab='Predicted Mean on y-scale',
addlayer=geom_point(aes(x=x1, y=y), data.frame(x1, y)))
# Had ggplot not added a subtitle (i.e., if x2 were not present), you
# could have done ggplot(Predict(), ylab=...) + geom_point(...)
} # }
# Make an 'interaction plot', forcing the x-axis variable to be
# plotted at integer values but labeled with category levels
n <- 100
set.seed(1)
gender <- c(rep('male', n), rep('female',n))
m <- sample(c('a','b'), 2*n, TRUE)
d <- datadist(gender, m); options(datadist='d')
anxiety <- runif(2*n) + .2*(gender=='female') + .4*(gender=='female' & m=='b')
tapply(anxiety, llist(gender,m), mean)
#> m
#> gender a b
#> female 0.68977 1.11430
#> male 0.39019 0.48358
f <- ols(anxiety ~ gender*m)
#> Error in Design(X, formula = formula): dataset d not found for options(datadist=)
p <- Predict(f, gender, m)
#> Error: object 'f' not found
# ggplot(p) # horizontal dot chart; usually preferred for categorical predictors
# ggplot(p, flipxdiscrete=FALSE) # back to vertical
ggplot(p, groups='gender')
#> Error: object 'p' not found
ggplot(p, ~ m, groups=FALSE, flipxdiscrete=FALSE)
#> Error: object 'p' not found
options(datadist=NULL)
if (FALSE) { # \dontrun{
# Example in which separate curves are shown for 4 income values
# For each curve the estimated percentage of voters voting for
# the democratic party is plotted against the percent of voters
# who graduated from college. Data are county-level percents.
incomes <- seq(22900, 32800, length=4)
# equally spaced to outer quintiles
p <- Predict(f, college, income=incomes, conf.int=FALSE)
ggplot(p, xlim=c(0,35), ylim=c(30,55))
# Erase end portions of each curve where there are fewer than 10 counties having
# percent of college graduates to the left of the x-coordinate being plotted,
# for the subset of counties having median family income with 1650
# of the target income for the curve
show.pts <- function(college.pts, income.pt) {
s <- abs(income - income.pt) < 1650 #assumes income known to top frame
x <- college[s]
x <- sort(x[!is.na(x)])
n <- length(x)
low <- x[10]; high <- x[n-9]
college.pts >= low & college.pts <= high
}
ggplot(p, xlim=c(0,35), ylim=c(30,55), perim=show.pts)
# Rename variables for better plotting of a long list of predictors
f <- ...
p <- Predict(f)
re <- c(trt='treatment', diabet='diabetes', sbp='systolic blood pressure')
for(n in names(re)) {
names(p)[names(p)==n] <- re[n]
p$.predictor.[p$.predictor.==n] <- re[n]
}
ggplot(p)
} # }