Stack together data frames — rbindFill • rockchalk

In the end of the code for plyr::rbind.fill, the author explains that is uses an experimental function to build the data.frame. I would rather not put any weight on an experimental function, so I sat out to create a new rbindFill. This function uses no experimental functions. It does not rely on any functions from packages that are not in base of R, except, of course, for functions in this package.

Usage

rbindFill(...)

Arguments

...: Data frames

Value

A stacked data frame

Details

Along the way, I noticed a feature that seems to be a flaw in both rbind and rbind.fill. In the examples, there is a demonstration of the fact that base R rbind and plyr::rbind.fill both have undesirable properties when data sets containing factors and ordered variables are involved. This function introduces a "data consistency check" that prevents corruption of variables when data frames are combined. This "safe" version will notice differences in classes of variables among data.frames and stop with an error message to alert the user to the problem.

Author

Paul Johnson

Examples

set.seed(123123)
N <- 10000
dat <- genCorrelatedData2(N, means = c(10, 20, 5, 5, 6, 7, 9), sds = 3,
           stde = 3, rho = .2,  beta = c(1, 1, -1, 0.5))
#> [1] "The equation that was calculated was"
#> y = 1 + 1*x1 + -1*x2 + 0.5*x3 + NA*x4 + NA*x5 + NA*x6 + NA*x7 
#>  + 0*x1*x1 + 0*x2*x1 + 0*x3*x1 + 0*x4*x1 + 0*x5*x1 + 0*x6*x1 + 0*x7*x1 
#>  + 0*x1*x2 + 0*x2*x2 + 0*x3*x2 + 0*x4*x2 + 0*x5*x2 + 0*x6*x2 + 0*x7*x2 
#>  + 0*x1*x3 + 0*x2*x3 + 0*x3*x3 + 0*x4*x3 + 0*x5*x3 + 0*x6*x3 + 0*x7*x3 
#>  + 0*x1*x4 + 0*x2*x4 + 0*x3*x4 + 0*x4*x4 + 0*x5*x4 + 0*x6*x4 + 0*x7*x4 
#>  + 0*x1*x5 + 0*x2*x5 + 0*x3*x5 + 0*x4*x5 + 0*x5*x5 + 0*x6*x5 + 0*x7*x5 
#>  + 0*x1*x6 + 0*x2*x6 + 0*x3*x6 + 0*x4*x6 + 0*x5*x6 + 0*x6*x6 + 0*x7*x6 
#>  + 0*x1*x7 + 0*x2*x7 + 0*x3*x7 + 0*x4*x7 + 0*x5*x7 + 0*x6*x7 + 0*x7*x7 
#>  + N(0,3) random error 
dat1 <- dat
dat1$xcat1 <- factor(sample(c("a", "b", "c", "d"), N, replace=TRUE))
dat1$xcat2 <- factor(sample(c("M", "F"), N, replace=TRUE),
                    levels = c("M", "F"), labels = c("Male", "Female"))
dat1$y <- dat$y +
          as.vector(contrasts(dat1$xcat1)[dat1$xcat1, ] %*% c(0.1, 0.2, 0.3))
dat1$xchar1 <- rep(letters[1:26], length.out = N)
dat2 <- dat
dat1$x3 <- NULL
dat2$x2 <- NULL
dat2$xcat2 <- factor(sample(c("M", "F"), N, replace=TRUE),
                     levels = c("M", "F"), labels = c("Male", "Female"))
dat2$xcat3 <- factor(sample(c("K1", "K2", "K3", "K4"), N, replace=TRUE))
dat2$xchar1 <- "1"
dat3 <- dat
dat3$x1 <- NULL
dat3$xcat3 <-  factor(sample(c("L1", "L2", "L3", "L4"), N, replace=TRUE)) 
dat.stack <- rbindFill(dat1, dat2, dat3)
str(dat.stack)
#> 'data.frame':	30000 obs. of  12 variables:
#>  $ y     : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ x1    : num  8.46 5.06 9.4 9.27 8.69 ...
#>  $ x2    : num  17.9 18.6 19 20.7 21 ...
#>  $ x4    : num  5.62 3.07 6.66 1.81 3.81 ...
#>  $ x5    : num  5.95 4.15 6.44 3.35 2.5 ...
#>  $ x6    : num  4.607 0.274 10.619 3.613 8.686 ...
#>  $ x7    : num  7.88 6.73 7.06 9.37 10.37 ...
#>  $ xcat1 : Factor w/ 4 levels "a","b","c","d": 1 4 4 2 4 1 3 1 1 4 ...
#>  $ xcat2 : Factor w/ 2 levels "Male","Female": 2 2 1 1 1 2 2 1 1 1 ...
#>  $ xchar1: chr  "a" "b" "c" "d" ...
#>  $ x3    : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ xcat3 : Factor w/ 8 levels "K1","K2","K3",..: NA NA NA NA NA NA NA NA NA NA ...

## Possible BUG alert about base::rbind and plyr::rbind.fill
## Demonstrate the problem of a same-named variable that is factor in one and
## an ordered variable in the other
dat5 <- data.frame(ds = "5", x1 = rnorm(N),
                   xcat1 = gl(20, 5, labels = LETTERS[20:1]))
dat6 <- data.frame(ds = "6", x1 = rnorm(N),
                   xcat1 = gl(20, 5, labels = LETTERS[1:20], ordered = TRUE))
## rbind reduces xcat1 to factor, whether we bind dat5 or dat6 first.
stack1 <- base::rbind(dat5, dat6)
str(stack1)
#> 'data.frame':	20000 obs. of  3 variables:
#>  $ ds   : chr  "5" "5" "5" "5" ...
#>  $ x1   : num  -0.35815 -1.02767 -0.63943 -0.00357 -0.13704 ...
#>  $ xcat1: Factor w/ 20 levels "T","S","R","Q",..: 1 1 1 1 1 2 2 2 2 2 ...
## note xcat1 levels are ordered T, S, R, Q
stack2 <- base::rbind(dat6, dat5)
str(stack2)
#> 'data.frame':	20000 obs. of  3 variables:
#>  $ ds   : chr  "6" "6" "6" "6" ...
#>  $ x1   : num  0.596 0.793 -0.389 0.408 -2.037 ...
#>  $ xcat1: Factor w/ 20 levels "A","B","C","D",..: 1 1 1 1 1 2 2 2 2 2 ...
## xcat1 levels are A, B, C, D
## stack3 <- plyr::rbind.fill(dat5, dat6)
## str(stack3)
## xcat1 is a factor with levels T, S, R, Q ...
## stack4 <- plyr::rbind.fill(dat6, dat5)
## str(stack4)
## oops, xcat1 is ordinal with levels A < B < C < D
## stack5 <- rbindFill(dat5, dat6)