Summary of a data frame consisting of: variable names and types, labels if any, factor levels, frequencies and/or numerical summary statistics, barplots/histograms, and valid/missing observation counts and proportions.
Usage
dfSummary(
x,
round.digits = 1,
varnumbers = st_options("dfSummary.varnumbers"),
class = st_options("dfSummary.class"),
labels.col = st_options("dfSummary.labels.col"),
valid.col = st_options("dfSummary.valid.col"),
na.col = st_options("dfSummary.na.col"),
graph.col = st_options("dfSummary.graph.col"),
graph.magnif = st_options("dfSummary.graph.magnif"),
style = st_options("dfSummary.style"),
plain.ascii = st_options("plain.ascii"),
justify = "l",
na.val = st_options("na.val"),
col.widths = NA,
headings = st_options("headings"),
display.labels = st_options("display.labels"),
max.distinct.values = 10,
trim.strings = FALSE,
max.string.width = 25,
split.cells = 40,
split.tables = Inf,
tmp.img.dir = st_options("tmp.img.dir"),
keep.grp.vars = FALSE,
silent = st_options("dfSummary.silent"),
...
)Arguments
- x
A data frame.
- round.digits
Number of significant digits to display. Defaults to
1. Does not affect proportions, which always show1digit.- varnumbers
Logical. Show variable numbers in the first column. Defaults to
TRUE. Can be set globally withst_options, option “dfSummary.varnumbers”.- class
Logical. Show data classes in Variable column.
TRUEby default.- labels.col
Logical. If
TRUE, variable labels (as defined with rapportools, Hmisc or summarytools'labelfunctions, among others) will be displayed.TRUEby default, but the labels column is only shown if a label exists for at least one column. Can be set globally withst_options, option “dfSummary.labels.col”.- valid.col
Logical. Include column indicating count and proportion of valid (non-missing) values.
TRUEby default; can be set globally withst_options, option “dfSummary.valid.col”.- na.col
Logical. Include column indicating count and proportion of missing (
NA) values.TRUEby default; can be set globally withst_options, option “dfSummary.na.col”.- graph.col
Logical. Display barplots/histograms column.
TRUEby default; can be set globally withst_options, option “dfSummary.graph.col”.- graph.magnif
Numeric. Magnification factor for graphs column. Useful if the graphs show up too large (then use a value such as .75) or too small (use a value such as
1.25). Must be positive. Defaults to1. Can be set globally withst_options, option “dfSummary.graph.magnif”.- style
Character. Argument used by
pander. Defaults to “multiline”. The only other valid option is “grid”. Style “rmarkdown” will fallback to “multiline”.- plain.ascii
Logical.
panderargument; whenTRUE, no markup characters will be used (useful when printing to console). Defaults toTRUE. Set toFALSEwhen in context of markdown rendering. To change the default value globally, seest_options.- justify
String indicating alignment of columns; one of “l” (left) “c” (center), or “r” (right). Defaults to “l”.
- na.val
Character. For factors and character vectors, consider this value as
NA. Ignored if there are actual NA values.NULLby default.- col.widths
Numeric or character. Vector of column widths. If numeric, values are assumed to be numbers of pixels. Otherwise, any CSS-supported units can be used.
NAby default, meaning widths are calculated automatically.- headings
Logical. Set to
FALSEto omit headings. To change this default value globally, seest_options.- display.labels
Logical. Should data frame label be displayed in the title section? Default is
TRUE. To change this default value globally, seest_options.- max.distinct.values
The maximum number of values to display frequencies for. If variable has more distinct values than this number, the remaining frequencies will be reported as a whole, along with the number of additional distinct values. Defaults to 10.
- trim.strings
Logical; for character variables, should leading and trailing white space be removed? Defaults to
FALSE. See details section.- max.string.width
Limits the number of characters to display in the frequency tables. Defaults to
25.- split.cells
A numeric argument passed to
pander. It is the number of characters allowed on a line before splitting the cell. Defaults to40.- split.tables
pander argument which determines the maximum width of a table. Keeping the default value (
Inf) is recommended.- tmp.img.dir
Character. Directory used to store temporary images when rendering dfSummary() with `method = "pander"`, `plain.ascii = TRUE` and `style = "grid"`. See Details.
- keep.grp.vars
Logical. When using
group_by, keep rows corresponding to grouping variable(s) in output table. WhenFALSE(default), variable numbers still reflect the the ordering in the full data frame (in other words, some numbers will be skipped in the variable number column).- silent
Logical. Hide console messages.
FALSEby default. To change this value globally, seest_options.- ...
Additional arguments passed to
pander.
Value
A data frame with additional class summarytools containing as
many rows as there are columns in x, with attributes to inform
print method. Columns in the output data frame are:
- No
Number indicating the order in which column appears in the data frame.
- Variable
Name of the variable, along with its class(es).
- Label
Label of the variable (if applicable).
- Stats / Values
For factors, a list of their values, limited by the
max.distinct.valuesparameter. For character variables, the most common values (in descending frequency order), also limited bymax.distinct.values. For numerical variables, common univariate statistics (mean, std. deviation, min, med, max, IQR and CV).- Freqs (% of Valid)
For factors and character variables, the frequencies and proportions of the values listed in the previous column. For numerical vectors, number of distinct values, or frequency of distinct values if their number is not greater than
max.distinct.values.- Text Graph
An ASCII histogram for numerical variables, and ASCII barplot for factors and character variables.
- Graph
An html encoded graph, either barplot or histogram.
- Valid
Number and proportion of valid values.
- Missing
Number and proportion of missing (NA and NAN) values.
Details
The default value plain.ascii = TRUE is intended to
facilitate interactive data exploration. When using the package for
reporting with rmarkdown, make sure to set this option to
FALSE.
When trim.strings is set to TRUE, trimming is done
before calculating frequencies, be aware that those will
be impacted accordingly.
Specifying tmp.img.dir allows producing results consistent with
pandoc styling while also showing png graphs. Due to the fact that
in Pandoc, column widths are determined by the length of cell contents
even if said content is merely a link to an image, using standard
R temporary directory to store the images would cause columns to be
exceedingly wide. A shorter path is needed. On Mac OS and Linux,
using “/tmp” is a sensible choice, since this directory is cleaned
up automatically on a regular basis. On Windows however, there is no such
convenient directory, so the user has to choose a directory and cleanup the
temporary images manually after the document has been rendered. Providing
a relative path such as “img”, omitting “./”, is recommended.
The maximum length for this parameter is set to 5 characters. It can be set
globally with st_options (e.g.:
st_options(tmp.img.dir = ".").
It is possible to control which statistics are shown in the
Stats / Values column. For this, see the Details and
Examples sections of st_options.
Note
Several packages provide functions for defining variable labels, summarytools being one of them. Some packages (Hmisc in particular) employ special classes for labelled objects, but summarytools doesn't use nor look for any such classes.
Author
Dominic Comtois, dominic.comtois@gmail.com
Examples
data("tobacco")
saved_x11_option <- st_options("use.x11")
st_options(use.x11 = FALSE)
dfSummary(tobacco)
#> Data Frame Summary
#> tobacco
#> Dimensions: 1000 x 9
#> Duplicates: 2
#>
#> --------------------------------------------------------------------------------------------------------------
#> No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
#> ---- -------------- ------------------------- --------------------- --------------------- ---------- ---------
#> 1 gender 1. F 489 (50.0%) IIIIIIIIII 978 22
#> [factor] 2. M 489 (50.0%) IIIIIIIIII (97.8%) (2.2%)
#>
#> 2 age Mean (sd) : 49.6 (18.3) 63 distinct values . . . . . : 975 25
#> [numeric] min < med < max: : : : : : . : : : : (97.5%) (2.5%)
#> 18 < 50 < 80 : : : : : : : : : :
#> IQR (CV) : 32 (0.4) : : : : : : : : : :
#> : : : : : : : : : :
#>
#> 3 age.gr 1. 18-34 258 (26.5%) IIIII 975 25
#> [factor] 2. 35-50 241 (24.7%) IIII (97.5%) (2.5%)
#> 3. 51-70 317 (32.5%) IIIIII
#> 4. 71 + 159 (16.3%) III
#>
#> 4 BMI Mean (sd) : 25.7 (4.5) 974 distinct values : 974 26
#> [numeric] min < med < max: : : : (97.4%) (2.6%)
#> 8.8 < 25.6 < 39.4 : : :
#> IQR (CV) : 5.7 (0.2) : : : : :
#> . : : : : : .
#>
#> 5 smoker 1. Yes 298 (29.8%) IIIII 1000 0
#> [factor] 2. No 702 (70.2%) IIIIIIIIIIIIII (100.0%) (0.0%)
#>
#> 6 cigs.per.day Mean (sd) : 6.8 (11.9) 37 distinct values : 965 35
#> [numeric] min < med < max: : (96.5%) (3.5%)
#> 0 < 0 < 40 :
#> IQR (CV) : 11 (1.8) :
#> : . . . . . .
#>
#> 7 diseased 1. Yes 224 (22.4%) IIII 1000 0
#> [factor] 2. No 776 (77.6%) IIIIIIIIIIIIIII (100.0%) (0.0%)
#>
#> 8 disease 1. Hypertension 36 (16.2%) III 222 778
#> [character] 2. Cancer 34 (15.3%) III (22.2%) (77.8%)
#> 3. Cholesterol 21 ( 9.5%) I
#> 4. Heart 20 ( 9.0%) I
#> 5. Pulmonary 20 ( 9.0%) I
#> 6. Musculoskeletal 19 ( 8.6%) I
#> 7. Diabetes 14 ( 6.3%) I
#> 8. Hearing 14 ( 6.3%) I
#> 9. Digestive 12 ( 5.4%) I
#> 10. Hypotension 11 ( 5.0%)
#> [ 3 others ] 21 ( 9.5%) I
#>
#> 9 samp.wgts Mean (sd) : 1 (0.1) 0.86!: 267 (26.7%) IIIII 1000 0
#> [numeric] min < med < max: 1.04!: 249 (24.9%) IIII (100.0%) (0.0%)
#> 0.9 < 1 < 1.1 1.05!: 324 (32.4%) IIIIII
#> IQR (CV) : 0.2 (0.1) 1.06!: 160 (16.0%) III
#> ! rounded
#> --------------------------------------------------------------------------------------------------------------
# Exclude some of the columns to reduce table width
dfSummary(tobacco, varnumbers = FALSE, valid.col = FALSE)
#> Data Frame Summary
#> tobacco
#> Dimensions: 1000 x 9
#> Duplicates: 2
#>
#> ----------------------------------------------------------------------------------------------
#> Variable Stats / Values Freqs (% of Valid) Graph Missing
#> -------------- ------------------------- --------------------- --------------------- ---------
#> gender 1. F 489 (50.0%) IIIIIIIIII 22
#> [factor] 2. M 489 (50.0%) IIIIIIIIII (2.2%)
#>
#> age Mean (sd) : 49.6 (18.3) 63 distinct values . . . . . : 25
#> [numeric] min < med < max: : : : : : . : : : : (2.5%)
#> 18 < 50 < 80 : : : : : : : : : :
#> IQR (CV) : 32 (0.4) : : : : : : : : : :
#> : : : : : : : : : :
#>
#> age.gr 1. 18-34 258 (26.5%) IIIII 25
#> [factor] 2. 35-50 241 (24.7%) IIII (2.5%)
#> 3. 51-70 317 (32.5%) IIIIII
#> 4. 71 + 159 (16.3%) III
#>
#> BMI Mean (sd) : 25.7 (4.5) 974 distinct values : 26
#> [numeric] min < med < max: : : : (2.6%)
#> 8.8 < 25.6 < 39.4 : : :
#> IQR (CV) : 5.7 (0.2) : : : : :
#> . : : : : : .
#>
#> smoker 1. Yes 298 (29.8%) IIIII 0
#> [factor] 2. No 702 (70.2%) IIIIIIIIIIIIII (0.0%)
#>
#> cigs.per.day Mean (sd) : 6.8 (11.9) 37 distinct values : 35
#> [numeric] min < med < max: : (3.5%)
#> 0 < 0 < 40 :
#> IQR (CV) : 11 (1.8) :
#> : . . . . . .
#>
#> diseased 1. Yes 224 (22.4%) IIII 0
#> [factor] 2. No 776 (77.6%) IIIIIIIIIIIIIII (0.0%)
#>
#> disease 1. Hypertension 36 (16.2%) III 778
#> [character] 2. Cancer 34 (15.3%) III (77.8%)
#> 3. Cholesterol 21 ( 9.5%) I
#> 4. Heart 20 ( 9.0%) I
#> 5. Pulmonary 20 ( 9.0%) I
#> 6. Musculoskeletal 19 ( 8.6%) I
#> 7. Diabetes 14 ( 6.3%) I
#> 8. Hearing 14 ( 6.3%) I
#> 9. Digestive 12 ( 5.4%) I
#> 10. Hypotension 11 ( 5.0%)
#> [ 3 others ] 21 ( 9.5%) I
#>
#> samp.wgts Mean (sd) : 1 (0.1) 0.86!: 267 (26.7%) IIIII 0
#> [numeric] min < med < max: 1.04!: 249 (24.9%) IIII (0.0%)
#> 0.9 < 1 < 1.1 1.05!: 324 (32.4%) IIIIII
#> IQR (CV) : 0.2 (0.1) 1.06!: 160 (16.0%) III
#> ! rounded
#> ----------------------------------------------------------------------------------------------
# Limit number of categories to be displayed for categorical data
dfSummary(tobacco, max.distinct.values = 5, style = "grid")
#> Data Frame Summary
#> tobacco
#> Dimensions: 1000 x 9
#> Duplicates: 2
#>
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
#> +====+==============+=========================+=====================+=====================+==========+=========+
#> | 1 | gender | 1. F | 489 (50.0%) | IIIIIIIIII | 978 | 22 |
#> | | [factor] | 2. M | 489 (50.0%) | IIIIIIIIII | (97.8%) | (2.2%) |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 2 | age | Mean (sd) : 49.6 (18.3) | 63 distinct values | . . . . . : | 975 | 25 |
#> | | [numeric] | min < med < max: | | : : : : : . : : : : | (97.5%) | (2.5%) |
#> | | | 18 < 50 < 80 | | : : : : : : : : : : | | |
#> | | | IQR (CV) : 32 (0.4) | | : : : : : : : : : : | | |
#> | | | | | : : : : : : : : : : | | |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 3 | age.gr | 1. 18-34 | 258 (26.5%) | IIIII | 975 | 25 |
#> | | [factor] | 2. 35-50 | 241 (24.7%) | IIII | (97.5%) | (2.5%) |
#> | | | 3. 51-70 | 317 (32.5%) | IIIIII | | |
#> | | | 4. 71 + | 159 (16.3%) | III | | |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 4 | BMI | Mean (sd) : 25.7 (4.5) | 974 distinct values | : | 974 | 26 |
#> | | [numeric] | min < med < max: | | : : : | (97.4%) | (2.6%) |
#> | | | 8.8 < 25.6 < 39.4 | | : : : | | |
#> | | | IQR (CV) : 5.7 (0.2) | | : : : : : | | |
#> | | | | | . : : : : : . | | |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 5 | smoker | 1. Yes | 298 (29.8%) | IIIII | 1000 | 0 |
#> | | [factor] | 2. No | 702 (70.2%) | IIIIIIIIIIIIII | (100.0%) | (0.0%) |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 6 | cigs.per.day | Mean (sd) : 6.8 (11.9) | 37 distinct values | : | 965 | 35 |
#> | | [numeric] | min < med < max: | | : | (96.5%) | (3.5%) |
#> | | | 0 < 0 < 40 | | : | | |
#> | | | IQR (CV) : 11 (1.8) | | : | | |
#> | | | | | : . . . . . . | | |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 7 | diseased | 1. Yes | 224 (22.4%) | IIII | 1000 | 0 |
#> | | [factor] | 2. No | 776 (77.6%) | IIIIIIIIIIIIIII | (100.0%) | (0.0%) |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 8 | disease | 1. Hypertension | 36 (16.2%) | III | 222 | 778 |
#> | | [character] | 2. Cancer | 34 (15.3%) | III | (22.2%) | (77.8%) |
#> | | | 3. Cholesterol | 21 ( 9.5%) | I | | |
#> | | | 4. Heart | 20 ( 9.0%) | I | | |
#> | | | 5. Pulmonary | 20 ( 9.0%) | I | | |
#> | | | [ 8 others ] | 91 (41.0%) | IIIIIIII | | |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
#> | 9 | samp.wgts | Mean (sd) : 1 (0.1) | 0.86!: 267 (26.7%) | IIIII | 1000 | 0 |
#> | | [numeric] | min < med < max: | 1.04!: 249 (24.9%) | IIII | (100.0%) | (0.0%) |
#> | | | 0.9 < 1 < 1.1 | 1.05!: 324 (32.4%) | IIIIII | | |
#> | | | IQR (CV) : 0.2 (0.1) | 1.06!: 160 (16.0%) | III | | |
#> | | | | ! rounded | | | |
#> +----+--------------+-------------------------+---------------------+---------------------+----------+---------+
# Using stby()
stby(tobacco, tobacco$gender, dfSummary)
#> NA detected in grouping variable(s); consider using useNA = TRUE
#> Data Frame Summary
#> tobacco
#> Group: gender = F
#> Dimensions: 489 x 9
#> Duplicates: 0
#>
#> --------------------------------------------------------------------------------------------------------------
#> No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
#> ---- -------------- ------------------------- --------------------- --------------------- ---------- ---------
#> 2 age Mean (sd) : 49.6 (18.3) 63 distinct values : . . . : . : 475 14
#> [numeric] min < med < max: : : : : : : : : : : (97.1%) (2.9%)
#> 18 < 50 < 80 : : : : : : : : : :
#> IQR (CV) : 32 (0.4) : : : : : : : : : :
#> : : : : : : : : : :
#>
#> 3 age.gr 1. 18-34 123 (25.9%) IIIII 475 14
#> [factor] 2. 35-50 118 (24.8%) IIII (97.1%) (2.9%)
#> 3. 51-70 157 (33.1%) IIIIII
#> 4. 71 + 77 (16.2%) III
#>
#> 4 BMI Mean (sd) : 26.1 (4.9) 475 distinct values : 475 14
#> [numeric] min < med < max: : : (97.1%) (2.9%)
#> 9 < 25.9 < 39.4 : : .
#> IQR (CV) : 6.5 (0.2) . : : :
#> : : : : .
#>
#> 5 smoker 1. Yes 147 (30.1%) IIIIII 489 0
#> [factor] 2. No 342 (69.9%) IIIIIIIIIIIII (100.0%) (0.0%)
#>
#> 6 cigs.per.day Mean (sd) : 6.9 (12) 37 distinct values : 468 21
#> [numeric] min < med < max: : (95.7%) (4.3%)
#> 0 < 0 < 40 :
#> IQR (CV) : 10.2 (1.8) :
#> : . . . . . .
#>
#> 7 diseased 1. Yes 111 (22.7%) IIII 489 0
#> [factor] 2. No 378 (77.3%) IIIIIIIIIIIIIII (100.0%) (0.0%)
#>
#> 8 disease 1. Hypertension 18 (16.5%) III 109 380
#> [character] 2. Cancer 16 (14.7%) II (22.3%) (77.7%)
#> 3. Cholesterol 10 ( 9.2%) I
#> 4. Heart 9 ( 8.3%) I
#> 5. Pulmonary 9 ( 8.3%) I
#> 6. Diabetes 8 ( 7.3%) I
#> 7. Musculoskeletal 8 ( 7.3%) I
#> 8. Hypotension 7 ( 6.4%) I
#> 9. Neurological 7 ( 6.4%) I
#> 10. Vision 6 ( 5.5%) I
#> [ 3 others ] 11 (10.1%) II
#>
#> 9 samp.wgts Mean (sd) : 1 (0.1) 0.86!: 131 (26.8%) IIIII 489 0
#> [numeric] min < med < max: 1.04!: 120 (24.5%) IIII (100.0%) (0.0%)
#> 0.9 < 1 < 1.1 1.05!: 160 (32.7%) IIIIII
#> IQR (CV) : 0.2 (0.1) 1.06!: 78 (16.0%) III
#> ! rounded
#> --------------------------------------------------------------------------------------------------------------
#>
#> Group: gender = M
#> Dimensions: 489 x 9
#> Duplicates: 2
#>
#> --------------------------------------------------------------------------------------------------------------
#> No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
#> ---- -------------- ------------------------- --------------------- --------------------- ---------- ---------
#> 2 age Mean (sd) : 49.6 (18.3) 63 distinct values . . . : 478 11
#> [numeric] min < med < max: : : : : : : : : : (97.8%) (2.2%)
#> 18 < 49.5 < 80 : : : : : : : : : :
#> IQR (CV) : 32 (0.4) : : : : : : : : : :
#> : : : : : : : : : :
#>
#> 3 age.gr 1. 18-34 130 (27.2%) IIIII 478 11
#> [factor] 2. 35-50 118 (24.7%) IIII (97.8%) (2.2%)
#> 3. 51-70 151 (31.6%) IIIIII
#> 4. 71 + 79 (16.5%) III
#>
#> 4 BMI Mean (sd) : 25.3 (4) 477 distinct values : . 477 12
#> [numeric] min < med < max: : : (97.5%) (2.5%)
#> 8.8 < 25.1 < 36.8 : : : .
#> IQR (CV) : 5.4 (0.2) : : : : .
#> : : : : : :
#>
#> 5 smoker 1. Yes 143 (29.2%) IIIII 489 0
#> [factor] 2. No 346 (70.8%) IIIIIIIIIIIIII (100.0%) (0.0%)
#>
#> 6 cigs.per.day Mean (sd) : 6.7 (11.8) 36 distinct values : 475 14
#> [numeric] min < med < max: : (97.1%) (2.9%)
#> 0 < 0 < 40 :
#> IQR (CV) : 11 (1.8) :
#> : . . . . .
#>
#> 7 diseased 1. Yes 110 (22.5%) IIII 489 0
#> [factor] 2. No 379 (77.5%) IIIIIIIIIIIIIII (100.0%) (0.0%)
#>
#> 8 disease 1. Cancer 18 (16.4%) III 110 379
#> [character] 2. Hypertension 17 (15.5%) III (22.5%) (77.5%)
#> 3. Cholesterol 11 (10.0%) II
#> 4. Heart 11 (10.0%) II
#> 5. Pulmonary 11 (10.0%) II
#> 6. Musculoskeletal 10 ( 9.1%) I
#> 7. Hearing 9 ( 8.2%) I
#> 8. Digestive 7 ( 6.4%) I
#> 9. Diabetes 5 ( 4.5%)
#> 10. Hypotension 4 ( 3.6%)
#> [ 3 others ] 7 ( 6.4%) I
#>
#> 9 samp.wgts Mean (sd) : 1 (0.1) 0.86!: 131 (26.8%) IIIII 489 0
#> [numeric] min < med < max: 1.04!: 124 (25.4%) IIIII (100.0%) (0.0%)
#> 0.9 < 1 < 1.1 1.05!: 155 (31.7%) IIIIII
#> IQR (CV) : 0.2 (0.1) 1.06!: 79 (16.2%) III
#> ! rounded
#> --------------------------------------------------------------------------------------------------------------
st_options(use.x11 = saved_x11_option)
if (FALSE) { # \dontrun{
# Show in Viewer or browser - no capital V in view(); stview() is also
# available in case of conflicts with other packages)
view(dfSummary(iris))
# Rmarkdown-ready
dfSummary(tobacco, style = "grid", plain.ascii = FALSE,
varnumbers = FALSE, valid.col = FALSE, tmp.img.dir = "./img")
# Using group_by()
tobacco %>% group_by(gender) %>% dfSummary()
} # }