sortLevels.rdappendLevels combines levels without sorting such that levels of the first argument will not require re-coding.
recodeLevels is a generic for recoding a factor to a desired set of levels - also has a method for large ff objects
sortLevels is a generic for level sorting and recoding of single factors or of all factors of a ffdf dataframe.
appendLevels(...)
recodeLevels(x, lev)
# S3 method for class 'factor'
recodeLevels(x, lev)
# S3 method for class 'ff'
recodeLevels(x, lev)
sortLevels(x)
# S3 method for class 'factor'
sortLevels(x)
# S3 method for class 'ff'
sortLevels(x)
# S3 method for class 'ffdf'
sortLevels(x)When reading a long file with categorical columns the final set of factor levels is only known once the complete file has been read.
When a file is so large that we read it in chunks, the new levels need to be added incrementally.
rbind.data.frame sorts combined levels, which requires recoding. For ff factors this would require recoding of all previous chunks at the next chunk - potentially on disk, which is too expensive.
Therefore read.table.ffdf will simply appendLevels without sorting, and the recodeLevels and sortLevels generics provide a convenient means for sorting and recoding levels after all chunks have been read.
appendLevels returns a vector of combined levels, recodeLevels and sortLevels return the input object with changed levels. Do read the note!
You need to re-assign the return value not only for ram- but also for ff-objects. Remember ff's hybrid copying semantics: LimWarn.
If you forget to re-assign the returned object, you will end up with ff objects that have their integer codes re-coded to the new levels but still carry the old levels as a virtual attribute.
message("Let's create a factor with little levels")
#> Let's create a factor with little levels
x <- ff(letters[4:6], levels=letters[4:6])
message("Let's interpret the same ff file without levels in order to see the codes")
#> Let's interpret the same ff file without levels in order to see the codes
y <- x
levels(y) <- NULL
levels(x)
#> [1] "d" "e" "f"
data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE)
#> factor codes
#> 1 d 1
#> 2 e 2
#> 3 f 3
levels(x) <- appendLevels(levels(x), letters)
levels(x)
#> [1] "d" "e" "f" "a" "b" "c" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE)
#> factor codes
#> 1 d 1
#> 2 e 2
#> 3 f 3
x <- sortLevels(x) # implicit recoding is chunked were necessary
levels(x)
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE)
#> factor codes
#> 1 d 4
#> 2 e 5
#> 3 f 6
message("NEVER forget to reassign the result of recodeLevels or sortLevels,
look at the following mess")
#> NEVER forget to reassign the result of recodeLevels or sortLevels,
#> look at the following mess
recodeLevels(x, rev(levels(x)))
#> ff (open) integer length=3 (3) levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
#> [1] [2] [3]
#> d e : f
message("NOW the codings have changed, but not the levels, the result is wrong data")
#> NOW the codings have changed, but not the levels, the result is wrong data
levels(x)
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
data.frame(factor=x[], codes=y[], stringsAsFactors = TRUE)
#> factor codes
#> 1 w 23
#> 2 v 22
#> 3 u 21
rm(x);gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 1181600 63.2 1994352 106.6 1994352 106.6
#> Vcells 2205442 16.9 8790397 67.1 8790397 67.1
if (FALSE) { # \dontrun{
n <- 5e7
message("reading a factor from a file ist as fast ...")
system.time(
fx <- ff(factor(letters[1:25]), length=n)
)
system.time(x <- fx[])
str(x)
rm(x); gc()
message("... as creating it in-RAM (R-2.11.1) which is theoretically impossible ...")
system.time({
x <- integer(n)
x[] <- 1:25
levels(x) <- letters[1:25]
class(x) <- "factor"
})
str(x)
rm(x); gc()
message("... but is possible if we avoid some unnecessary copying that is triggered
by assignment functions")
system.time({
x <- integer(n)
x[] <- 1:25
setattr(x, "levels", letters[1:25])
setattr(x, "class", "factor")
})
str(x)
rm(x); gc()
rm(n)
} # }