Generate a data grid of user-specified values for use in the newdata argument of the predictions(), comparisons(), and slopes() functions. This is useful to define where in the predictor space we want to evaluate the quantities of interest. Ex: the predicted outcome or slope for a 37 year old college graduate.
Usage
datagrid(
...,
model = NULL,
newdata = NULL,
by = NULL,
grid_type = "mean_or_mode",
response = FALSE,
FUN = NULL,
FUN_character = NULL,
FUN_factor = NULL,
FUN_logical = NULL,
FUN_numeric = NULL,
FUN_integer = NULL,
FUN_binary = NULL,
FUN_other = NULL
)Arguments
- ...
named arguments with vectors of values or functions for user-specified variables.
Functions are applied to the variable in the
modeldataset ornewdata, and must return a vector of the appropriate type.Character vectors are automatically transformed to factors if necessary.
The output will include all combinations of these variables (see Examples below.)
- model
Model object
- newdata
data.frame (one and only one of the
modelandnewdataarguments can be used.)- by
character vector with grouping variables within which
FUN_*functions are applied to create "sub-grids" with unspecified variables.- grid_type
character. Determines the functions to apply to each variable. The defaults can be overridden by defining individual variables explicitly in
..., or by supplying a function to one of theFUN_*arguments."mean_or_mode": Character, factor, logical, and binary variables are set to their modes. Numeric, integer, and other variables are set to their means.
"balanced": Each unique level of character, factor, logical, and binary variables are preserved. Numeric, integer, and other variables are set to their means. Warning: When there are many variables and many levels per variable, a balanced grid can be very large. In those cases, it is better to use
grid_type="mean_or_mode"and to specify the unique levels of a subset of named variables explicitly."dataframe": Similar to "mean_or_mode" but creates a data frame by binding columns element-wise rather than taking the cross-product. All explicitly specified vectors must have the same length (or length 1), and the result has as many rows as the longest vector. This differs from other grid types which use
expand.grid()ordata.table::CJ()to create all combinations."counterfactual": the entire dataset is duplicated for each combination of the variable values specified in
.... Variables not explicitly supplied todatagrid()are set to their observed values in the original dataset.
- response
Logical should the response variable be included in the grid, even if it is not specified explicitly.
- FUN
a function to be applied to all variables in the grid. This is useful when you want to apply the same function to all variables, such as
meanormedian. If you specifyFUN, it will override thegrid_typedefaults, but not otherFUN_*arguments below.- FUN_character
the function to be applied to character variables.
- FUN_factor
the function to be applied to factor variables. This only applies if the variable in the original data is a factor. For variables converted to factor in a model-fitting formula, for example,
FUN_characteris used.- FUN_logical
the function to be applied to logical variables.
- FUN_numeric
the function to be applied to numeric variables.
- FUN_integer
the function to be applied to integer-ish variables (including columns without decimal places).
- FUN_binary
the function to be applied to binary variables.
- FUN_other
the function to be applied to other variable types.
Value
A data.frame in which each row corresponds to one combination of the named
predictors supplied by the user via the ... dots. Variables which are not
explicitly defined are held at their mean or mode.
Details
If datagrid is used in a predictions(), comparisons(), or slopes() call as the
newdata argument, the model is automatically inserted in the model argument of datagrid()
call, and users do not need to specify either the model or newdata arguments. The same behavior will occur when the value supplied to newdata= is a function call which starts with "datagrid". This is intended to allow users to create convenience shortcuts like:
Warning about hierarchical grouping variables: When using the default grid_type = "mean_or_mode" with hierarchical models (such as mixed models with nested grouping factors), datagrid() may create invalid combinations of grouping variables. For example, if you have students nested within schools, or countries nested within regions, the modal values of each grouping variable may not correspond to valid nested relationships in the data. This can cause prediction errors. To avoid this issue, explicitly specify valid combinations of hierarchical grouping variables in the datagrid() call, or use grid_type = "counterfactual" to preserve the original data structure.
mod <- lm(mpg ~ am + vs + factor(cyl) + hp, mtcars)
datagrid_bal <- function(...) datagrid(..., grid_type = "balanced")
predictions(model, newdata = datagrid_bal(cyl = 4))If users supply a model, the data used to fit that model is retrieved using
the insight::get_data function.
Examples
# The output only has 2 rows, and all the variables except `hp` are at their
# mean or mode.
datagrid(newdata = mtcars, hp = c(100, 110))
#> rowid mpg cyl disp drat wt qsec vs am gear carb hp
#> 1 1 20.09062 6 230.7219 3.596563 3.21725 17.84875 0 0 4 3 100
#> 2 2 20.09062 6 230.7219 3.596563 3.21725 17.84875 0 0 4 3 110
# We get the same result by feeding a model instead of a data.frame
mod <- lm(mpg ~ hp, mtcars)
datagrid(model = mod, hp = c(100, 110))
#> rowid mpg hp
#> 1 1 20.09062 100
#> 2 2 20.09062 110
# Use in `marginaleffects` to compute "Typical Marginal Effects". When used
# in `slopes()` or `predictions()` we do not need to specify the
# `model` or `newdata` arguments.
slopes(mod, newdata = datagrid(hp = c(100, 110)))
#>
#> hp Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
#> 100 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#> 110 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#>
#> Term: hp
#> Type: response
#> Comparison: dY/dX
#>
# datagrid accepts functions
datagrid(hp = range, cyl = unique, newdata = mtcars)
#> rowid mpg disp drat wt qsec vs am gear carb hp cyl
#> 1 1 20.09062 230.7219 3.596563 3.21725 17.84875 0 0 4 3 52 4
#> 2 2 20.09062 230.7219 3.596563 3.21725 17.84875 0 0 4 3 52 6
#> 3 3 20.09062 230.7219 3.596563 3.21725 17.84875 0 0 4 3 52 8
#> 4 4 20.09062 230.7219 3.596563 3.21725 17.84875 0 0 4 3 335 4
#> 5 5 20.09062 230.7219 3.596563 3.21725 17.84875 0 0 4 3 335 6
#> 6 6 20.09062 230.7219 3.596563 3.21725 17.84875 0 0 4 3 335 8
comparisons(mod, newdata = datagrid(hp = fivenum))
#>
#> hp Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
#> 52 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#> 96 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#> 123 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#> 180 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#> 335 -0.0682 0.0101 -6.74 <0.001 35.9 -0.0881 -0.0484
#>
#> Term: hp
#> Type: response
#> Comparison: +1
#>
# The full dataset is duplicated with each observation given counterfactual
# values of 100 and 110 for the `hp` variable. The original `mtcars` includes
# 32 rows, so the resulting dataset includes 64 rows.
dg <- datagrid(newdata = mtcars, hp = c(100, 110), grid_type = "counterfactual")
nrow(dg)
#> [1] 64
# We get the same result by feeding a model instead of a data.frame
mod <- lm(mpg ~ hp, mtcars)
dg <- datagrid(model = mod, hp = c(100, 110), grid_type = "counterfactual")
nrow(dg)
#> [1] 64
# Use `by` to hold variables at group-specific values
mod2 <- lm(mpg ~ hp + cyl, mtcars)
datagrid(model = mod2, hp = mean, by = "cyl")
#> rowid cyl mpg hp
#> 1 1 4 26.66364 82.63636
#> 2 2 6 19.74286 122.28571
#> 3 3 8 15.10000 209.21429
# Use `FUN` to apply function to all variables
datagrid(model = mod2, FUN = median)
#> rowid cyl hp mpg
#> 1 1 6 123 19.2
# Use `grid_type="dataframe"` for column-wise binding instead of cross-product
datagrid(model = mod2, hp = c(100, 200), cyl = c(4, 6), grid_type = "dataframe")
#> rowid mpg hp cyl
#> 1 1 20.09062 100 4
#> 2 2 20.09062 200 6