Generate correlated data for simulations (third edition)
genCorrelatedData3.RdThis is a revision of genCorrelatedData2. The output is a
data frame that has columns for the predictors along with an error
term, the linear predictor, and the observed value of the outcome
variable. The new features are in the user interface. It has a
better way to specify beta coefficients. It is also more flexible
in the specification of the names of the predictor columns.
Arguments
- formula
a text variable, e.g.,
"y ~ 1 + 2*x1". Use ":" to create squared and interaction terms,"y ~ 1 + 2*x1 + 1.1*x1:x1 + 0.2*x1:x2".Multi-way interactions are allowed, eg"y ~ 1 + 2*x1 + .4*x2 + .1*x3 + 1.1*x1:x1 + 0.2*x1:x2:x3".Note author can specify any order of interation.- N
sample size
- means
averages of predictors, can include names c(x1 = 10, x2 = 20) that will be used in the data.frame result.
- sds
standard deviations, 1 (common value for all variables) or as many elements as in
means.- rho
correlations, can be 1 or a vech for a correlation matrix
- stde
The scale of the error term. If
distrib=rnorm, stde is the standard deviation of the error term. If the user changes the distribution, this is a scale parameter that may not be equal to the standard deviation. For example,distrib=rlogisthas a scale parameter such that a value of stde implies the error's standard deviation will be \(stde * pi / sqrt(3)\).- beta
slope coefficients, use either this or
formula, not both. It is easier (less error prone) to use named coefficients, but (for backwards compatability withgenCorrelatedData2) names are not required. If named, use "Intercept" for the intercept coefficient, and use variable names that match thexmeansvector. Un-named coefficients follow same rules as ingenCorrelatedData2. The first (1 + p) values are for the intercept and p main effects. With 3 predictors and no squares or interactions, specify four betas corresponding toc(Intercept, x1, x2, x3). The squared and interaction terms may follow. The largest possible model would correspond toc(Intercept, x1, x2, x3, x1:x1, x1:x2, x1:x3, x2:x2, x2:x3, x3:x3). Squares and interactions fill in a "lower triangle". The unnamed beta vector can be terminated with the last non-zero coefficient, the function will insert 0's for the coefficients at the end of the vector.- intercept
TRUE or FALSE. Should the output data set include a column of 1's. If beta is an unnamed vector, should the first element be treated as an intercept?
- col.names
Can override names in means vector
- verbose
TRUE for diagnostics
- ...
extra arguments, ignored for now. We use that to ignore unrecognized parameters.
- distrib
An R random data generating function. Default is
rnorm. Alsorlogisor any other two-parameter location/scale distribution will work. Special configuration allowsrt. See details.
Details
The enhanced methods for authors to specify a data-generating
process are as follows. Either way will work and the choice
between the methods is driven by the author's convenience.
1. Use the formula argument as a quoted string:
"1 + 2.2 * x1 + 2.3 * x2 + 3.3 * x3 + 1.9 * x1:x2". The "*" represents multiplication of coefficient times variable, and the colon ":" has same meaning but it is used for products of variables.2. Use the beta argument with parameter names,
beta = c("Intercept" = 1, x1 = 2.2, x2 = 2.3, x3 = 3.3, "x1:x2" = 1.9)where the names are the same as the names of the variables in the formula. Names of the variables in the formula or the beta vector should be used also in either the means parameter or the col.names parameter.
The error distribution can be specified. Default is normal, with
draws provided by R's rnorm. All error models assume
\(E[e] = 0\) and the scale coefficient is the parameter
stde. Thus, the default setup's error will be drawn from
rnorm(N, 0, stde). Any two parameter "location" and "scale"
distribution should work as well, as long as the first coefficient
is location (because we set that as 0 in all cases) and the second
argument is scale. For example, distrib=rlogis, will lead
to errors drawn from rlogis(N, 0, stde). Caution: in rlogis,
the scale parameter is not the same as standard deviation.
The only one parameter distribution currently supported is the T
distribution. If user specifies distrib=rt, then the
stde is passed through to the parameter df. Note
that if increasing the stde parameter will cause the standard
deviation of rt to get smaller. df=1 implies sd =
794.6; df=2 implies sd = 3.27; df=3 implies 1.7773.
Methods to specify error distributions in a more flexible way need to be considered.
Author
Paul Johnson pauljohn@ku.edu and Gabor Grothendieck <ggrothendieck@gmail.com>
Examples
set.seed(123123)
## note: x4 is an unused variable in formula
X1a <-
genCorrelatedData3("y ~ 1.1 + 2.1 * x1 + 3 * x2 + 3.5 * x3 + 1.1 * x1:x3",
N = 1000, means = c(x1 = 1, x2 = -1, x3 = 3, x4 = 1),
sds = 1, rho = 0.4, stde = 5)
lm1a <- lm(y ~ x1 + x2 + x3 + x1:x3, data = X1a)
## note that normal errors have std.error. close to 5
summary(lm1a)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x1:x3, data = X1a)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.1237 -3.4118 -0.0516 3.5351 14.0045
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.3462 0.7499 1.795 0.0729 .
#> x1 2.3052 0.4880 4.724 2.65e-06 ***
#> x2 3.4963 0.1817 19.241 < 2e-16 ***
#> x3 3.5691 0.2404 14.847 < 2e-16 ***
#> x1:x3 0.9988 0.1546 6.461 1.63e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 5.022 on 995 degrees of freedom
#> Multiple R-squared: 0.8057, Adjusted R-squared: 0.8049
#> F-statistic: 1031 on 4 and 995 DF, p-value: < 2.2e-16
#>
attr(X1a, "beta")
#> x1 x2 x3 x1:x3
#> 1.1 2.1 3.0 3.5 1.1
attr(X1a, "formula")
#> [1] "1.1*1 + 2.1*x1 + 3*x2 + 3.5*x3 + 1.1*x1*x3"
## Demonstrate name beta vector method to provide named arguments
set.seed(123123)
X2 <- genCorrelatedData3(N = 1000, means = c(x1 = 1, x2 = -1, x3 = 3, x4 = 1),
sds = 1, rho = 0.4,
beta = c("Intercept" = 1.1, x1 = 2.1, x2 = 3,
x3 = 3.5, "x1:x3" = 1.1),
intercept = TRUE, stde = 5)
attr(X2, c("beta"))
#> Intercept x1 x2 x3 x1:x3
#> 1.1 2.1 3.0 3.5 1.1
attr(X2, c("formula"))
#> [1] "1.1*1 + 2.1*x1 + 3*x2 + 3.5*x3 + 1.1*x1*x3"
head(X2)
#> Intercept x1 x2 x3 x4 error eta
#> 1 1 0.6928368 -1.0590339 3.0087527 0.7000470 -5.5049728 12.201522
#> 2 1 0.9499863 -0.5197346 0.8072111 0.8623604 9.4421276 5.204530
#> 3 1 1.7725536 0.2809647 3.1467152 0.9272930 0.4504675 22.814254
#> 4 1 0.3854582 -1.8219058 2.5092442 0.2722184 1.9986776 6.290029
#> 5 1 0.8729748 -1.0363895 4.3534449 1.6607136 13.8316154 19.241628
#> 6 1 1.5960719 -0.9723073 2.7087208 0.4267363 0.2805470 15.770996
#> y
#> 1 6.696549
#> 2 14.646657
#> 3 23.264721
#> 4 8.288707
#> 5 33.073244
#> 6 16.051543
lm2 <- lm(y ~ x1 + x2 + x3 + x1:x3, data = X2)
summary(lm2)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x1:x3, data = X2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.1237 -3.4118 -0.0516 3.5351 14.0045
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.3462 0.7499 1.795 0.0729 .
#> x1 2.3052 0.4880 4.724 2.65e-06 ***
#> x2 3.4963 0.1817 19.241 < 2e-16 ***
#> x3 3.5691 0.2404 14.847 < 2e-16 ***
#> x1:x3 0.9988 0.1546 6.461 1.63e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 5.022 on 995 degrees of freedom
#> Multiple R-squared: 0.8057, Adjusted R-squared: 0.8049
#> F-statistic: 1031 on 4 and 995 DF, p-value: < 2.2e-16
#>
## Equivalent with unnamed beta vector. Must carefully count empty
## spots, fill in 0's when coefficient is not present. This
## method was in genCorrelated2. Order of coefficents is
## c(intercept, x1, ..., xp, x1:x1, x1:x2, x1:xp, x2:x2, x2:x3, ..., )
## filling in a lower triangle.
set.seed(123123)
X3 <- genCorrelatedData3(N = 1000, means = c(x1 = 1, x2 = -1, x3 = 3, x4 = 1),
sds = 1, rho = 0.4,
beta = c(1.1, 2.1, 3, 3.5, 0, 0, 0, 1.1),
intercept = TRUE, stde = 5)
attr(X3, c("beta"))
#> Intercept x1 x2 x3 x4 x1:x1 x2:x1 x3:x1
#> 1.1 2.1 3.0 3.5 0.0 0.0 0.0 1.1
#> x4:x1 x2:x2 x3:x2 x4:x2 x3:x3 x4:x3 x4:x4
#> 0.0 0.0 0.0 0.0 0.0 0.0 0.0
attr(X3, c("formula"))
#> [1] "1.1*1 + 2.1*x1 + 3*x2 + 3.5*x3 + 0*x4 + 0*x1*x1 + 0*x2*x1 + 1.1*x3*x1 + 0*x4*x1 + 0*x2*x2 + 0*x3*x2 + 0*x4*x2 + 0*x3*x3 + 0*x4*x3 + 0*x4*x4"
head(X3)
#> Intercept x1 x2 x3 x4 error eta
#> 1 1 0.6928368 -1.0590339 3.0087527 0.7000470 -5.5049728 12.201522
#> 2 1 0.9499863 -0.5197346 0.8072111 0.8623604 9.4421276 5.204530
#> 3 1 1.7725536 0.2809647 3.1467152 0.9272930 0.4504675 22.814254
#> 4 1 0.3854582 -1.8219058 2.5092442 0.2722184 1.9986776 6.290029
#> 5 1 0.8729748 -1.0363895 4.3534449 1.6607136 13.8316154 19.241628
#> 6 1 1.5960719 -0.9723073 2.7087208 0.4267363 0.2805470 15.770996
#> y
#> 1 6.696549
#> 2 14.646657
#> 3 23.264721
#> 4 8.288707
#> 5 33.073244
#> 6 16.051543
lm3 <- lm(y ~ x1 + x2 + x3 + x1:x3, data = X3)
summary(lm3)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x1:x3, data = X3)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17.1237 -3.4118 -0.0516 3.5351 14.0045
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.3462 0.7499 1.795 0.0729 .
#> x1 2.3052 0.4880 4.724 2.65e-06 ***
#> x2 3.4963 0.1817 19.241 < 2e-16 ***
#> x3 3.5691 0.2404 14.847 < 2e-16 ***
#> x1:x3 0.9988 0.1546 6.461 1.63e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 5.022 on 995 degrees of freedom
#> Multiple R-squared: 0.8057, Adjusted R-squared: 0.8049
#> F-statistic: 1031 on 4 and 995 DF, p-value: < 2.2e-16
#>
## Same with more interesting variable names in the means vector
X3 <- genCorrelatedData3(N = 1000,
means = c(friend = 1, enemy = -1, ally = 3, neutral = 1),
sds = 1, rho = 0.4,
beta = c(1.1, 2.1, 3, 3.5, 0, 0, 0, 1.1),
intercept = TRUE, stde = 5)
head(X3)
#> Intercept friend enemy ally neutral error eta
#> 1 1 -0.1681050 -0.7861065 3.756930 1.7048401 0.6514663 10.843199
#> 2 1 0.2516115 -0.8363906 4.065826 0.6284787 -5.9844878 14.474913
#> 3 1 0.8386541 -1.4332007 4.135587 0.3841888 2.7038132 16.851286
#> 4 1 -0.4300436 0.6559291 1.329120 0.2977512 6.1820822 6.187878
#> 5 1 1.6837800 -0.4756650 4.352780 2.0642991 -1.2478193 26.505712
#> 6 1 1.7576422 -0.8458070 2.948483 1.6212896 12.4401769 18.273933
#> y
#> 1 11.494665
#> 2 8.490425
#> 3 19.555100
#> 4 12.369960
#> 5 25.257893
#> 6 30.714110
attr(X3, c("beta"))
#> Intercept friend enemy ally neutral
#> 1.1 2.1 3.0 3.5 0.0
#> friend:friend enemy:friend ally:friend neutral:friend enemy:enemy
#> 0.0 0.0 1.1 0.0 0.0
#> ally:enemy neutral:enemy ally:ally neutral:ally neutral:neutral
#> 0.0 0.0 0.0 0.0 0.0
X3 <- genCorrelatedData3(N = 1000, means = c(x1 = 50, x2 = 50, x3 = 50),
sds = 10, rho = 0.4,
beta = c("Intercept" = .1, x1 = .01, x2 = .2, x3 = .5,
"x1:x3" = .1))
lm3 <- lm(y ~ x1 + x2 + x3 + x1:x3, data = X3)
## Names via col.names argument: must match formula
X2 <- genCorrelatedData3("y ~ 1.1 + 2.1 * educ + 3 * hlth + 3 * ses + 1.1 * educ:ses",
N = 100, means = c(50, 50, 50, 20),
sds = 10, rho = 0.4, col.names = c("educ", "hlth", "ses", "wght"))
str(X2)
#> 'data.frame': 100 obs. of 7 variables:
#> $ educ : num 46.2 54 56.3 50.5 47.8 ...
#> $ hlth : num 59 47.3 54 46.9 48.3 ...
#> $ ses : num 56.9 52.7 60.8 47.8 34.6 ...
#> $ wght : num 27.35 20.21 24.6 8.27 23.44 ...
#> $ error: num 0.0254 -1.7375 1.8675 -0.8238 -1.3211 ...
#> $ eta : num 3338 3542 4229 3046 2170 ...
#> $ y : num 3338 3540 4230 3045 2169 ...
#> - attr(*, "beta")= Named num [1:5] 1.1 2.1 3 3 1.1
#> ..- attr(*, "names")= chr [1:5] "" "educ" "hlth" "ses" ...
#> - attr(*, "formula")= chr "1.1*1 + 2.1*educ + 3*hlth + 3*ses + 1.1*educ*ses"
#> - attr(*, "stde")= num 1
X3 <- genCorrelatedData3("y ~ 1.1 + 2.1 * educ + 3 * hlth + 3 * ses + 1.1 * educ:ses",
N = 100, means = c(50, 50, 50, 20),
sds = 10, rho = 0.4, col.names = c("educ", "hlth", "ses", "wght"),
intercept = TRUE)
str(X3)
#> 'data.frame': 100 obs. of 8 variables:
#> $ Intercept: num 1 1 1 1 1 1 1 1 1 1 ...
#> $ educ : num 51.3 43.5 50.2 63.1 48.2 ...
#> $ hlth : num 39.2 66.2 51.3 59.9 36 ...
#> $ ses : num 32.5 45.5 56.7 61 40.9 ...
#> $ wght : num -0.918 3.353 19.995 18.3 6.12 ...
#> $ error : num -0.355 -1.1511 -1.4833 0.0189 0.6038 ...
#> $ eta : num 2158 2608 3567 4731 2499 ...
#> $ y : num 2158 2606 3566 4731 2499 ...
#> - attr(*, "beta")= Named num [1:5] 1.1 2.1 3 3 1.1
#> ..- attr(*, "names")= chr [1:5] "" "educ" "hlth" "ses" ...
#> - attr(*, "formula")= chr "1.1*1 + 2.1*educ + 3*hlth + 3*ses + 1.1*educ*ses"
#> - attr(*, "stde")= num 1
## note the logistic errors have residual std.error approximately 5 * pi/sqrt(3)
X1b <-
genCorrelatedData3("y ~ 1.1 + 2.1 * x1 + 3 * x2 + 3.5 * x3 + 1.1 * x1:x3",
N = 1000, means = c(x1 = 1, x2 = -1, x3 = 3),
sds = 1, rho = 0.4, stde = 5, distrib = rlogis)
lm1b <- lm(y ~ x1 + x2 + x3 + x1:x3, data = X1b)
summary(lm1b)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x1:x3, data = X1b)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -37.232 -5.603 0.347 5.415 39.077
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.002895 1.356108 -0.002 0.99830
#> x1 2.686829 0.908773 2.957 0.00318 **
#> x2 2.253258 0.347466 6.485 1.40e-10 ***
#> x3 3.364816 0.422523 7.964 4.54e-15 ***
#> x1:x3 1.132114 0.282692 4.005 6.67e-05 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 9.555 on 995 degrees of freedom
#> Multiple R-squared: 0.5353, Adjusted R-squared: 0.5334
#> F-statistic: 286.5 on 4 and 995 DF, p-value: < 2.2e-16
#>
## t distribution is very sensitive for fractional df between 1 and 2 (recall
## stde parameter is passed through to df in rt.
X1c <-
genCorrelatedData3("y ~ 1.1 + 2.1 * x1 + 3 * x2 + 3.5 * x3 + 1.1 * x1:x3",
N = 1000, means = c(x1 = 1, x2 = -1, x3 = 3),
sds = 1, rho = 0.4, stde = 1.2, distrib = rt)
lm1c <- lm(y ~ x1 + x2 + x3 + x1:x3, data = X1c)
summary(lm1c)
#>
#> Call:
#> lm(formula = y ~ x1 + x2 + x3 + x1:x3, data = X1c)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -125.307 -1.029 0.027 0.959 73.215
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2272 0.9350 0.243 0.808
#> x1 2.7200 0.5880 4.626 4.22e-06 ***
#> x2 3.1911 0.2333 13.680 < 2e-16 ***
#> x3 3.8702 0.2936 13.182 < 2e-16 ***
#> x1:x3 0.9144 0.1793 5.099 4.08e-07 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 6.473 on 995 degrees of freedom
#> Multiple R-squared: 0.7263, Adjusted R-squared: 0.7251
#> F-statistic: 659.9 on 4 and 995 DF, p-value: < 2.2e-16
#>