Create a function to simulate data

This function is used to create a new function that will simulate data. This could be used by a teacher to create homework or test conditions that the students would then simulate data from (each student could have their own unique data set) or this function could be used in simulations for power or other values of interest.

simfun(expr, drop, ...)

Arguments

expr: This is an expression, usually just one or more statements, that will generate the simulated data.
drop: A character vector of names of objects/columns that will be dropped from the return value. These are usually intermediate objects or parameter values that you don't want carried into the final returned object.
...: Additional named items that will be in the environment when expr is evaluated.

Details

This function creates another function to simulate data. You supply the general ideas of the simulation to this function and the resulting function can then be used to create simulated datasets. The resulting function can then be given to students for them to simulate datasets, or used localy as part of larger simulations.

The environment where the expression is evaluated will have all the columns or elements of the data argument available as well as the data argument itself. Any variables/parameters passed through ... in the original function will also be available. You then supply the code based on those variables to create the simulated data. The names of any columns or parameters submitted as part of data will need to match the code exactly (provide specific directions to the users on what columns need to be named). Rember that indexing using factors indexes based on the underlying integers not the character representation. See the examples for details.

The resulting function can be saved and loaded/attached in different R sessions (it is important to use save rather than something like dput so that the environment of the function is preserved).

The function includes an optional seed that will be used with the char2seed function (if the seed is a character) so that each student could use a unique but identifiable seed (such as their name or something based on their name) so that each student will use a different dataset, but the instructor will be able to generate the exact same dataset to check answers.

The "True" parameters are hidden in the environment of the function so the student will not see the "true" values by simply printing the function. However an intermediate level R programmer/user would be able to extract the simulation parameters (but the correct homework or test answer will not be the simulation parameters).

Value

The return value is a function that will generate simulated datasets. The function will have 2 arguments, data and seed. The data argument can be either a data frame of the predictor variables (study design) or a list of simulation parameters. The seed argument will be passed on to set.seed if it is numeric and char2seed if it is a character.

The return value of this function is a dataframe with the simulated data and any explanitory variables passed to the function.

See the examples for how to use the result function.

Author

Greg Snow, 538280@gmail.com

Note

This function was not designed for speed, if you are doing long simulations then hand crafting the simulation function will probably run quicker than one created using this function.

Like the prediction functions the data frame passed in as the data argument will need to have exact names of the columns to match with the code (including capitolization).

This function is different from the simulate functions in that it allows for different sample sizes, user specified parameters, and different predictor variables.

Examples

# Create a function to simulate heights for a given dataset

simheight <- simfun( {h <- c(64,69); height<-h[sex]+ rnorm(10,0,3)}, drop='h' )

my.df <- data.frame(sex=factor(rep(c('Male','Female'),each=5)))
simdat <- simheight(my.df)
t.test(height~sex, data=simdat)
#> 
#> 	Welch Two Sample t-test
#> 
#> data:  height by sex
#> t = -1.9324, df = 7.2485, p-value = 0.09316
#> alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
#> 95 percent confidence interval:
#>  -10.600661   1.029828
#> sample estimates:
#> mean in group Female   mean in group Male 
#>             63.47723             68.26265 
#> 

# a more general version, and have the expression predefined
# (note that this assumes that the levels are Female, Male in that order)

myexpr <- quote({
  n <- length(sex)
  h <- c(64,69)
  height <- h[sex] + rnorm(n,0,3)
})

simheight <- simfun(eval(myexpr), drop=c('n','h'))
my.df <- data.frame(sex=factor(sample(rep(c('Male','Female'),c(5,10)))))
(simdat <- simheight(my.df))
#> Error in eval(expr): object 'myexpr' not found

# similar to above, but use named parameter vector and index by names

myexpr <- quote({
  n <- length(sex)
  height <- h[ as.character(sex)] + rnorm(n,0,sig)
})

simheight <- simfun(eval(myexpr), drop=c('n','h','sig'), 
  h=c(Male=69,Female=64), sig=3)
my.df <- data.frame(sex=factor(sample(c('Male','Female'),100, replace=TRUE)))
(simdat <- simheight(my.df, seed='example'))
#> Error in eval(expr): object 'myexpr' not found

# Create a function to simulate Sex and Height for a given sample size 
# (actually it will generate n males and n females for a total of 2*n samples)
# then use it in a set of simulations
simheight <- simfun( {sex <- factor(rep(c('Male','Female'),each=n))
                      height <- h[sex] + rnorm(2*n,0,s)
                      }, drop=c('h','n'), h=c(64,69), s=3)
(simdat <- simheight(list(n=10)))
#>    s   height    sex
#> 1  3 73.81695   Male
#> 2  3 70.70982   Male
#> 3  3 64.66008   Male
#> 4  3 68.74707   Male
#> 5  3 65.45090   Male
#> 6  3 70.29465   Male
#> 7  3 69.75234   Male
#> 8  3 63.21590   Male
#> 9  3 69.37512   Male
#> 10 3 70.20940   Male
#> 11 3 58.81850 Female
#> 12 3 60.28757 Female
#> 13 3 62.91988 Female
#> 14 3 57.09279 Female
#> 15 3 65.59018 Female
#> 16 3 64.55662 Female
#> 17 3 67.34557 Female
#> 18 3 65.62901 Female
#> 19 3 64.37090 Female
#> 20 3 66.88287 Female

out5  <- replicate(1000, t.test(height~sex, data=simheight(list(n= 5)))$p.value)
out15 <- replicate(1000, t.test(height~sex, data=simheight(list(n=15)))$p.value)

mean(out5  <= 0.05)
#> [1] 0.632
mean(out15 <= 0.05)
#> [1] 0.993

# use a fixed population

simstate <- simfun({
  tmp <- state.df[as.character(State),]
  Population <- tmp[['Population']]
  Income <- tmp[['Income']]
  Illiteracy <- tmp[['Illiteracy']]
}, state.df=as.data.frame(state.x77), drop=c('tmp','state.df'))
simstate(data.frame(State=sample(state.name,10)))
#>          State Illiteracy Income Population
#> 1     Kentucky        1.6   3712       3387
#> 2      Georgia        2.0   4091       4931
#> 3     Arkansas        1.9   3378       2110
#> 4   New Mexico        2.2   3601       1144
#> 5        Texas        2.2   4188      12237
#> 6     Nebraska        0.6   4508       1544
#> 7     Virginia        1.4   4701       4981
#> 8        Idaho        0.6   4119        813
#> 9       Alaska        1.5   6315        365
#> 10 Connecticut        1.1   5348       3100

# Use simulation, but override setting the seed

simheight <- simfun({
  set.seed(1234)
  h <- c(64,69)
  sex <- factor(rep(c('Female','Male'),each=50))
  height <- round(rnorm(100, rep(h,each=50),3),1)
  sex <- sex[ID]
  height <- height[ID]
}, drop='h')
(newdat <- simheight(list(ID=c(1:5,51:55))))
#>    ID height    sex
#> 1   1   60.4 Female
#> 2   2   64.8 Female
#> 3   3   67.3 Female
#> 4   4   57.0 Female
#> 5   5   65.3 Female
#> 6  51   63.6   Male
#> 7  52   67.3   Male
#> 8  53   65.7   Male
#> 9  54   66.0   Male
#> 10 55   68.5   Male
(newdat2<- simheight(list(ID=1:10)))
#>    ID height    sex
#> 1   1   60.4 Female
#> 2   2   64.8 Female
#> 3   3   67.3 Female
#> 4   4   57.0 Female
#> 5   5   65.3 Female
#> 6   6   65.5 Female
#> 7   7   62.3 Female
#> 8   8   62.4 Female
#> 9   9   62.3 Female
#> 10 10   61.3 Female

# Using a fitted object

fit <- lm(Fertility ~ . , data=swiss)
simfert <- simfun({
  Fertility <- predict(fit, newdata=data)
  Fertility <- Fertility + rnorm(length(Fertility),0,summary(fit)$sigma)
}, drop=c('fit'), fit=fit)

tmpdat <- as.data.frame(lapply(swiss[,-1], 
            function(x) round(runif(100, min(x), max(x)))))
names(tmpdat) <- names(swiss)[-1]
fertdat <- simfert(tmpdat)
head(fertdat)
#>   Agriculture Examination Education Catholic Infant.Mortality Fertility
#> 1          60          15        37       68               15  27.13375
#> 2          48          36        23       81               14  56.87613
#> 3          29          21        40       86               13  39.00761
#> 4          69          18        41        7               20  35.33664
#> 5          48          35        31       68               18  37.71881
#> 6          66          18        11       85               17  72.41400
rbind(coef(fit), coef(lm(Fertility~., data=fertdat)))
#>      (Intercept) Agriculture Examination  Education  Catholic Infant.Mortality
#> [1,]    66.91518  -0.1721140  -0.2580082 -0.8709401 0.1041153         1.077048
#> [2,]    52.93949  -0.1145198  -0.1642061 -0.8820233 0.1159883         1.529692

# simulate a nested mixed effects model
simheight <- simfun({
  n.city <- length(unique(city))
  n.state <- length(unique(state))
  n <- length(city)
  height <- h[sex] + rnorm(n.state,0,sig.state)[state] + 
    rnorm(n.city,0,sig.city)[city] + rnorm(n,0,sig.e)
}, sig.state=1, sig.city=0.5, sig.e=3, h=c(64,69),
  drop=c('sig.state','sig.city','sig.e','h','n.city','n.state','n'))

tmpdat <- data.frame(state=gl(5,20), city=gl(10,10), 
  sex=gl(2,5,length=100, labels=c('F','M')))
heightdat <- simheight(tmpdat)

# similar to above, but include cost information, this assumes that 
# each new state costs $100, each new city is $10, and each subject is $1
# this shows 2 possible methods

simheight <- simfun({
  n.city <- length(unique(city))
  n.state <- length(unique(state))
  n <- length(city)
  height <- h[sex] + rnorm(n.state,0,sig.state)[state] + 
    rnorm(n.city,0,sig.city)[city] + rnorm(n,0,sig.e)
  cost <- 100 * (!duplicated(state)) + 10*(!duplicated(city)) + 1
  cat('The total cost for this design is $', 100*n.state+10*n.city+1*n, 
  '\n', sep='')
}, sig.state=1, sig.city=0.5, sig.e=3, h=c(64,69),
  drop=c('sig.state','sig.city','sig.e','h','n.city','n.state','n'))

tmpdat <- data.frame(state=gl(5,20), city=gl(10,10), 
  sex=gl(2,5,length=100, labels=c('F','M')))
heightdat <- simheight(tmpdat)
#> The total cost for this design is $700
sum(heightdat$cost)
#> [1] 700


# another mixed model method

simheight <- simfun({
  state <- gl(n.state, n/n.state)
  city <- gl(n.city*n.state, n/n.city/n.state)
  sex <- gl(2, n.city, length=n, labels=c('F','M') )
  height <- h[sex] + rnorm(n.state,0,sig.state)[state] + 
    rnorm(n.city*n.state,0,sig.city)[city] + rnorm(n,0,sig.e)
}, drop=c('n.state','n.city','n','sig.city','sig.state','sig.e','h'))

heightdat <- simheight( list(
  n.state=5, n.city=2, n=100, sig.state=10, sig.city=3, sig.e=1, h=c(64,69)
))