Concise Statistical Description of a Vector, Matrix, Data Frame, or Formula
describe.Rddescribe is a generic method that invokes describe.data.frame,
describe.matrix, describe.vector, or
describe.formula. describe.vector is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 distinct values. In this case,
quantiles are not printed. A frequency table is printed
for any non-binary variable if it has no more than 20 distinct
values. For any variable for which the frequency table is not printed,
the 5 lowest and highest values are printed. This behavior can be
overriden for long character variables with many levels using the
listunique parameter, to get a complete tabulation.
describe is especially useful for
describing data frames created by *.get, as labels, formats,
value labels, and (in the case of sas.get) frequencies of special
missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class "impute", a count of the number of imputed values is
printed. If a date variable has an attribute partial.date
(this is set up by sas.get), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function substi (which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
For numeric variables, describe adds an item called Info
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties. Info is related to how
continuous the variable is, and ties are less harmful the more untied
values there are. The formula for Info is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size. The lowest information
comes from a variable having only one distinct value following by a
highly skewed binary variable. Info is reported to
two decimal places.
A latex method exists for converting the describe object to a
LaTeX file. For numeric variables having more than 20 distinct values,
describe saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
When there are less than or equal to 20 distinct values, the original
values are maintained.
latex and html insert a spike histogram displaying these
frequency counts in the tabular material using the LaTeX picture
environment. For example output see
https://hbiostat.org/doc/rms/book/chapter7edition1.pdf.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.
The html method mimics the LaTeX output. This is useful in the
context of Quarto/Rmarkdown html and html notebook output.
If options(prType='html') is in effect, calling print on
an object that is the result of running describe on a data frame
will result in rendering the HTML version. If run from the console a
browser window will open. When which is specified to
print, whether or not prType='html' is in effect, a
gt package html table will be produced containing only
the types of variables requested. When which='both' a list with
element names Continuous and Categorical is produced,
making it convenient for the user to print as desired, or to pass the
list directed to the qreport maketabs function when using Quarto.
The plot method is for describe objects run on data
frames. It produces spike histograms for a graphic of
continuous variables and a dot chart for categorical variables, showing
category proportions. The graphic format is ggplot2 if the user
has not set options(grType='plotly') or has set the grType
option to something other than 'plotly'. Otherwise plotly
graphics that are interactive are produced, and these can be placed into
an Rmarkdown html notebook. The user must install the plotly
package for this to work. When the use hovers the mouse over a bin for
a raw data value, the actual value will pop-up (formatted using
digits). When the user hovers over the minimum data value, most
of the information calculated by describe will pop up. For each
variable, the number of missing values is used to assign the color to
the histogram or dot chart, and a legend is drawn. Color is not used if
there are no missing values in any variable. For categorical variables,
hovering over the leftmost point for a variable displays details, and
for all points proportions, numerators, and denominators are displayed
in the popup. If both continuous and categorical variables are present
and which='both' is specified, the plot method returns an
unclassed list containing two objects, named 'Categorical'
and 'Continuous', in that order.
Sample weights may be specified to any of the functions, resulting in weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4) pp. 557, the term "unique" has been replaced with "distinct" in the output (but not in parameter names).
When weights are not used, the pseudomedian and Gini's mean difference are computed for
numeric variables. The pseudomedian is labeled pMedian and is the median of all possible pairwise averages. It is a robust and efficient measure of location that equals the mean and median for symmetric distributions. It is also called the Hodges-Lehmann one-sample estimator. Gini's mean difference is a robust measure of dispersion that is the
mean absolute difference between any pairs of observations. In simple
output Gini's difference is labeled Gmd.
formatdescribeSingle is a service function for latex,
html, and print methods for single variables that is not
intended to be called by the user.
Usage
# S3 method for class 'vector'
describe(x, descript, exclude.missing=TRUE, digits=4,
listunique=0, listnchar=12,
weights=NULL, normwt=FALSE, minlength=NULL, shortmChoice=TRUE,
rmhtml=FALSE, trans=NULL, lumptails=0.01, ...)
# S3 method for class 'matrix'
describe(x, descript, exclude.missing=TRUE, digits=4, ...)
# S3 method for class 'data.frame'
describe(x, descript, exclude.missing=TRUE,
digits=4, trans=NULL, ...)
# S3 method for class 'formula'
describe(x, descript, data, subset, na.action,
digits=4, weights, ...)
# S3 method for class 'describe'
print(x, which = c('both', 'categorical', 'continuous'), ...)
# S3 method for class 'describe'
latex(object, title=NULL,
file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
append=FALSE, size='small', tabular=TRUE, greek=TRUE,
spacing=0.7, lspace=c(0,0), ...)
# S3 method for class 'describe.single'
latex(object, title=NULL, vname,
file, append=FALSE, size='small', tabular=TRUE, greek=TRUE,
lspace=c(0,0), ...)
# S3 method for class 'describe'
html(object, size=85, tabular=TRUE,
greek=TRUE, scroll=FALSE, rows=25, cols=100, ...)
# S3 method for class 'describe.single'
html(object, size=85,
tabular=TRUE, greek=TRUE, ...)
formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'),
lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0),
size=85, ...)
# S3 method for class 'describe'
plot(x, which=c('both', 'continuous', 'categorical'),
what=NULL,
sort=c('ascending', 'descending', 'none'),
n.unique=10, digits=5, bvspace=2, ...)Arguments
- x
a data frame, matrix, vector, or formula. For a data frame, the
describe.data.framefunction is automatically invoked. For a matrix,describe.matrixis called. For a formula, describe.data.frame(model.frame(x)) is invoked. The formula may or may not have a response variable. Forprint,latex,html, orformatdescribeSingle,xis an object created bydescribe.- descript
optional title to print for x. The default is the name of the argument or the "label" attributes of individual variables. When the first argument is a formula,
descriptdefaults to a character representation of the formula.- exclude.missing
set toTRUE to print the names of variables that contain only missing values. This list appears at the bottom of the printout, and no space is taken up for such variables in the main listing.
- digits
number of significant digits to print. For
plot.describeis the number of significant digits to put in hover text forplotlywhen showing raw variable values.- listunique
For a character variable that is not an
mChoicevariable, that has its longest string length greater thanlistnchar, and that has no more thanlistuniquedistinct values, all values are listed in alphabetic order. Any value having more than one occurrence has the frequency of occurrence included. Specifylistuniqueequal to some value at least as large as the number of observations to ensure that all character variables will have all their values listed. For purposes of tabulating character strings, multiple white spaces of any kind are translated to a single space, leading and trailing white space are ignored, and case is ignored.- listnchar
see
listunique- weights
a numeric vector of frequencies or sample weights. Each observation will be treated as if it were sampled
weightstimes.- minlength
value passed to summary.mChoice
- shortmChoice
set to
FALSEto have summary ofmChoicevariables use actual levels everywhere, instead of abbreviating to integers and printing of all original labels at the top- rmhtml
set to
TRUEto strip html from variable labels- trans
for
describe.vectoris a list specifying how to transformxfor constructing the frequency distribution used in spike histograms. The first element of the list is a character string describing the transformation, the second is the transformation function, and the third argument is the inverse of this function that is used in labeling points on the original scale, e.g.trans=list('log', log, exp). Fordescribe.data.frametransis a list of such lists, with the name of each list being name of the variable to which the transformation applies. See https://hbiostat.org/rmsc/impred.html#data for an example.- lumptails
specifies the quantile to use (its complement is also used) for grouping observations in the tails so that outliers have less chance of distorting the variable's range for sparkline spike histograms. The default is 0.01, i.e., observations below the 0.01 quantile are grouped together in the leftmost bin, and observations above the 0.99 quantile are grouped to form the last bin.
- normwt
The default,
normwt=FALSEresults in the use ofweightsas weights in computing various statistics. In this case the sample size is assumed to be equal to the sum ofweights. Specifynormwt=TRUEto divideweightsby a constant so thatweightssum to the number of observations (length of vectors specified todescribe). In this case the number of observations is taken to be the actual number of records given todescribe.- object
a result of
describe- title
unused
- data
a data frame, data table, or list
- subset
a subsetting expression
- na.action
These are used if a formula is specified.
na.actiondefaults tona.retainwhich does not delete anyNAs from the data frame. Usena.action=na.omitorna.deleteto drop any observation with anyNAbefore processing.- ...
arguments passed to
describe.defaultwhich are passed to calls toformatfor numeric variables. For example if using RPOSIXctorDatedate/time formats, specifyingdescribe(d,format='%d%b%y')will print date/time variables as"01Jan2000". This is useful for omitting the time component. See the help file forformat.POSIXctorformat.Datefor more information. Forplotmethods, ... is ignored. Forhtmlandlatexmethods, ... is used to pass optional arguments toformatdescribeSingle, especially thecondenseargument. For theprintmethod whenwhich=is given, possible arguments to use for tabulating continuous variable output aresparkwidth(the width of the spike histogram sparkline in pixels, defaulting to 200),qcondense(set toFALSEto devote separate columns to all quantiles),extremes(set toTRUEto print the 5 lowest and highest values in the table of continuous variables). For categorical variable output, the argumentfreqcan be used to specify how frequency tables are rendered:'chart'(the default; an interactive sparkline frequency bar chart) orfreq='table'for small tables.sortis another argument passed tohtml_describe_cat. For sparkline frequency charts the default is to sort non-numeric categories in descending order of frequency. Setcode=FALSEto use the original data order. Thewargument also applies to categorical variable output.- file
name of output file (should have a suffix of .tex). Default name is formed from the first word of the
descriptelement of thedescribeobject, prefixed by"describe". Setfile=""to send LaTeX code to standard output instead of a file.- append
set to
TRUEto havelatexappend text to an existing file namedfile- size
LaTeX text size (
"small", the default, or"normalsize","tiny","scriptsize", etc.) for thedescribeoutput in LaTeX. For html is the percent of the prevailing font size to use for the output.- tabular
set to
FALSEto use verbatim rather than tabular (or html table) environment for the summary statistics output. By default, tabular is used if the output is not too wide.- greek
By default, the
latexandhtmlmethods will change names of greek letters that appear in variable labels to appropriate LaTeX symbols in math mode, or html symbols, unlessgreek=FALSE.- spacing
By default, the
latexmethod fordescriberun on a matrix or data frame uses thesetspaceLaTeX package with a line spacing of 0.7 so as to no waste space. Specifyspacing=0to suppress the use of thesetspace'sspacingenvironment, or specify another positive value to use this environment with a different spacing.- lspace
extra vertical scape, in character size units (i.e., "ex" as appended to the space). When using certain font sizes, there is too much space left around LaTeX verbatim environments. This two-vector specifies space to remove (i.e., the values are negated in forming the
vspacecommand) before (first element) and after (second element oflspace) verbatims- scroll
set to
TRUEto create an html scrollable box for the html output- rows, cols
the number of rows or columns to allocate for the scrollable box
- vname
unused argument in
latex.describe.single- which
specifies whether to plot numeric continuous or binary/categorical variables, or both. When
"both"a list with two elements is created. Each element is aggplot2orplotlyobject. If there are no variables of a given type, a singleggplot2orplotlyobject is returned, ready to print. Forprint.describemay be"categorical"or"continuous", causing agttable to be created with the categorical or continuous variabledescriberesults.- what
character or numeric vector specifying which variables to plot; default is to plot all
- sort
specifies how and whether variables are sorted in order of the proportion of positives when
which="categorical". Specifysort="none"to leave variables in the order they appear in the original data.- n.unique
the minimum number of distinct values a numeric variable must have before
plot.describeuses it in a continuous variable plot- bvspace
the between-variable spacing for categorical variables. Defaults to 2, meaning twice the amount of vertical space as what is used for between-category spacing within a variable
- condense
specifies whether to condense the output with regard to the 5 lowest and highest values (
"extremes") and the frequency table- lang
specifies the markup language
- verb
set to 1 if a verbatim environment is already in effect for LaTeX
Value
a list containing elements descript, counts,
values. The list is of class describe. If the input
object was a matrix or a data
frame, the list is a list of lists, one list for each variable
analyzed. latex returns a standard latex object. For numeric
variables having at least 20 distinct values, an additional component
intervalFreq. This component is a list with two elements, range
(containing two values) and count, a vector of 100 integer frequency
counts. print with which= returns a `gt` table object.
The user can modify the table by piping formatting changes, column
removals, and other operations, before final rendering.
Details
If options(na.detail.response=TRUE)
has been set and na.action is "na.delete" or
"na.keep", summary statistics on
the response variable are printed separately for missing and non-missing
values of each predictor. The default summary function returns
the number of non-missing response values and the mean of the last
column of the response values, with a names attribute of
c("N","Mean").
When the response is a Surv object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response. The actual summary function can be designated through
options(na.fun.response = "function name").
If you are modifying LaTex parskip or certain other parameters,
you may need to shrink the area around tabular and
verbatim environments produced by latex.describe. You can
do this using for example
\usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=-1.4pt
\partopsep=0pt}\preto{\@tabular}{\parskip=2pt
\parsep=0pt}\makeatother in the LaTeX preamble.
Multiple choice (mChoice) variables' describe output renders well in html but not when included in a Quarto document.
Author
Frank Harrell
Vanderbilt University
fh@fharrell.com
Examples
set.seed(1)
describe(runif(200),dig=2) #single variable, continuous
#> runif(200)
#> n missing distinct Info Mean pMedian Gmd .05
#> 200 0 200 1 0.52 0.52 0.31 0.084
#> .10 .25 .50 .75 .90 .95
#> 0.142 0.294 0.505 0.742 0.881 0.927
#>
#> lowest : 0.0130776 0.0133903 0.0233312 0.0355406 0.0589344
#> highest: 0.976171 0.985095 0.991839 0.991906 0.992684
#get quantiles .05,.10,\dots
dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)
#> dfr
#>
#> 2 Variables 400 Observations
#> --------------------------------------------------------------------------------
#> x
#> n missing distinct Info Mean pMedian Gmd .05
#> 400 0 400 1 -0.0463 -0.04444 1.223 -1.79403
#> .10 .25 .50 .75 .90 .95
#> -1.43026 -0.82824 -0.01549 0.68933 1.34958 1.77431
#>
#> lowest : -2.99695 -2.93977 -2.59611 -2.51443 -2.44231
#> highest: 2.25188 2.32133 2.34949 2.67574 3.05574
#> --------------------------------------------------------------------------------
#> y
#> n missing distinct
#> 400 0 2
#>
#> Value female male
#> Frequency 213 187
#> Proportion 0.532 0.468
#> --------------------------------------------------------------------------------
if (FALSE) { # \dontrun{
options(grType='plotly')
d <- describe(mydata)
p <- plot(d) # create plots for both types of variables
p[[1]]; p[[2]] # or p$Categorical; p$Continuous
plotly::subplot(p[[1]], p[[2]], nrows=2) # plot both in one
plot(d, which='categorical') # categorical ones
d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
describe(d) #describe entire data frame
attach(d, 1)
describe(relig) #Has special missing values .D .F .M .R .T
#attr(relig,"label") is "Religious preference"
#relig : Religious preference Format:relig
# n missing D F M R T distinct
# 4038 263 45 33 7 2 1 8
#
#0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%)
#3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%)
#5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%)
# Method for describing part of a data frame:
describe(death.time ~ age*sex + rcs(blood.pressure))
describe(~ age+sex)
describe(~ age+sex, weights=freqs) # weighted analysis
fit <- lrm(y ~ age*sex + log(height))
describe(formula(fit))
describe(y ~ age*sex, na.action=na.delete)
# report on number deleted for each variable
options(na.detail.response=TRUE)
# keep missings separately for each x, report on dist of y by x=NA
describe(y ~ age*sex)
options(na.fun.response="quantile")
describe(y ~ age*sex) # same but use quantiles of y by x=NA
d <- describe(my.data.frame)
d$age # print description for just age
d[c('age','sex')] # print description for two variables
d[sort(names(d))] # print in alphabetic order by var. names
d2 <- d[20:30] # keep variables 20-30
page(d2) # pop-up window for these variables
# Test date/time formats and suppression of times when they don't vary
library(chron)
d <- data.frame(a=chron((1:20)+.1),
b=chron((1:20)+(1:20)/100),
d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
hour=1:20,min=1:20,sec=1:20),
g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
describe(d)
# Make a function to run describe, latex.describe, and use the kdvi
# previewer in Linux to view the result and easily make a pdf file
ldesc <- function(data) {
options(xdvicmd='kdvi')
d <- describe(data, desc=deparse(substitute(data)))
dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
}
ldesc(d)
} # }