read.table.ffdf.rdFunction read.table.ffdf reads separated flat files into ffdf objects, very much like (and using) read.table.
It can also work with any convenience wrappers like read.csv and provides its own convenience wrapper (e.g. read.csv.ffdf) for R's usual wrappers.
read.table.ffdf(
x = NULL
, file, fileEncoding = ""
, nrows = -1, first.rows = NULL, next.rows = NULL
, levels = NULL, appendLevels = TRUE
, FUN = "read.table", ...
, transFUN = NULL
, asffdf_args = list()
, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE
)
read.csv.ffdf(...)
read.csv2.ffdf(...)
read.delim.ffdf(...)
read.delim2.ffdf(...)NULL or an optional ffdf object to which the read records are appended.
If this is provided, it defines crucial features that are otherwise determnined during the 'first' chunk of reading:
vmodes, colnames, colClasses, sequence of predefined levels.
the name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does
not contain an absolute path, the file name is
relative to the current working directory,
getwd(). Tilde-expansion is performed where supported.
Alternatively, file can be a readable text-mode
connection (which will be opened for reading if
necessary, and if so closed (and hence destroyed) at
the end of the function call).
character string: if non-empty declares the
encoding used on a file (not a connection) so the character data can
be re-encoded. See file.
integer: the maximum number of rows to read in (includes first.rows in case a 'first' chunk is read) Negative and other invalid values are ignored.
integer: number of rows to be read in the first chunk, see details. Default is the value given at next.rows or 1e3 otherwise.
Ignored if x is given.
integer: number of rows to be read in further chunks, see details.
By default calculated as BATCHBYTES %/% sum(.rambytes[vmode(x)])
NULL or an optional list, each element named with col.names of factor columns specifies the levels
Ignored if x is given.
logical.
A vector of permissions to expand levels for factor columns.
Recycled as necessary, or if the logical vector is named, unspecified values are taken to be TRUE.
Ignored during processing of the 'first' chunk
character: name of a function that is called for reading each chunk, see read.table, read.csv, etc.
further arguments, passed to FUN in read.table.ffdf, or passed to read.table.ffdf in the convenience wrappers
NULL or a function that is called on each data.frame chunk after reading with FUN and before further processing (for filtering, transformations etc.)
further arguments passed to as.ffdf when converting the data.frame of the first chunk to ffdf.
Ignored if x is given.
integer: bytes allowed for the size of the data.frame storing the result of reading one chunk. Default getOption("ffbatchbytes").
logical: TRUE to verbose timings for each processed chunk (default FALSE)
read.table.ffdf has been designed to read very large (many rows) separated flatfiles in row-chunks
and store the result in a ffdf object on disk, but quickly accessible via ff techniques.
The first chunk is read with a default of 1000 rows, for subsequent chunks the number of rows is calculated to not require more RAM than getOption("ffbatchbytes").
The following could be indications to change the parameter first.rows:
set first.rows=-1 to read the complete file in one go (requires enough RAM)
set first.rows to a smaller number if the pre-allocation of RAM for the first chunk with parameter nrows in read.table is too large, i.e. with many columns on machine with little RAM.
set first.rows to a larger number if you expect better factor level ordering (factor levels are sorted in the first chunk, but not at subsequent chunks, however, factor level ordering can be fixed later, see below).
By default the ffdf object is created on the fly at the end of reading the 'first' chunk, see argument first.rows.
The creation of the ffdf object is done via as.ffdf and can be finetuned by passing argument asffdf_args.
Even more control is possible by passing in a ffdf object as argument x to which the read records are appended.
read.table.ffdf has been designed to behave as much like read.table as possible. Hoever, note the following differences:
Arguments 'colClasses' and 'col.names' are now enforced also during 'next.rows' chunks.
For example giving colClasses=NA will force that no colClasses are derived from the first.rows respective from the ffdf object in parameter x.
colClass 'ordered' is allowed and will create an ordered factor
character vector are not supported, character data must be read as one of the following colClasses: 'Date', 'POSIXct', 'factor, 'ordered'. By default character columns are read as factors. Accordingly arguments 'as.is' and 'stringsAsFactors' are not allowed.
the sequence of levels.ff from chunked reading can depend on chunk size: by default new levels found on a chunk are appended to the levels found in previous chunks, no attempt is made to sort and recode the levels during chunked processing, levels can be sorted and recoded most efficiently after all records have been read using sortLevels.
the default for argument 'comment.char' is "" even for those FUN that have a different default. However, explicit specification of 'comment.char' will have priority.
Note that using the 'skip' argument still requires to read the file from beginning in order to count the lines to be skipped.
If you first read part of the file in order to understand its structure and then want to continue,
a more efficient solution that using 'skip' is opening a file connection and pass that to argument 'file'.
read.table.ffdf does the same in order to skip efficiently over previously read chunks.
An ffdf object. If created during the 'first' chunk pass, it will have one physical component per virtual column.
message("create some csv data on disk")
#> create some csv data on disk
x <- data.frame(
log=rep(c(FALSE, TRUE), length.out=26)
, int=1:26
, dbl=1:26 + 0.1
, fac=factor(letters)
, ord=ordered(LETTERS)
, dct=Sys.time()+1:26
, dat=seq(as.Date("1910/1/1"), length.out=26, by=1)
, stringsAsFactors = TRUE
)
x <- x[c(13:1, 13:1),]
csvfile <- tempPathFile(path=getOption("fftempdir"), extension="csv")
write.csv(x, file=csvfile, row.names=FALSE)
cat("Simply read csv with header\n")
#> Simply read csv with header
y <- read.csv(file=csvfile, header=TRUE)
y
#> log int dbl fac ord dct dat
#> 1 FALSE 13 13.1 m M 2025-03-17 22:36:08.015253 1910-01-13
#> 2 TRUE 12 12.1 l L 2025-03-17 22:36:07.015253 1910-01-12
#> 3 FALSE 11 11.1 k K 2025-03-17 22:36:06.015253 1910-01-11
#> 4 TRUE 10 10.1 j J 2025-03-17 22:36:05.015253 1910-01-10
#> 5 FALSE 9 9.1 i I 2025-03-17 22:36:04.015253 1910-01-09
#> 6 TRUE 8 8.1 h H 2025-03-17 22:36:03.015253 1910-01-08
#> 7 FALSE 7 7.1 g G 2025-03-17 22:36:02.015253 1910-01-07
#> 8 TRUE 6 6.1 f F 2025-03-17 22:36:01.015253 1910-01-06
#> 9 FALSE 5 5.1 e E 2025-03-17 22:36:00.015253 1910-01-05
#> 10 TRUE 4 4.1 d D 2025-03-17 22:35:59.015253 1910-01-04
#> 11 FALSE 3 3.1 c C 2025-03-17 22:35:58.015253 1910-01-03
#> 12 TRUE 2 2.1 b B 2025-03-17 22:35:57.015253 1910-01-02
#> 13 FALSE 1 1.1 a A 2025-03-17 22:35:56.015253 1910-01-01
#> 14 FALSE 13 13.1 m M 2025-03-17 22:36:08.015253 1910-01-13
#> 15 TRUE 12 12.1 l L 2025-03-17 22:36:07.015253 1910-01-12
#> 16 FALSE 11 11.1 k K 2025-03-17 22:36:06.015253 1910-01-11
#> 17 TRUE 10 10.1 j J 2025-03-17 22:36:05.015253 1910-01-10
#> 18 FALSE 9 9.1 i I 2025-03-17 22:36:04.015253 1910-01-09
#> 19 TRUE 8 8.1 h H 2025-03-17 22:36:03.015253 1910-01-08
#> 20 FALSE 7 7.1 g G 2025-03-17 22:36:02.015253 1910-01-07
#> 21 TRUE 6 6.1 f F 2025-03-17 22:36:01.015253 1910-01-06
#> 22 FALSE 5 5.1 e E 2025-03-17 22:36:00.015253 1910-01-05
#> 23 TRUE 4 4.1 d D 2025-03-17 22:35:59.015253 1910-01-04
#> 24 FALSE 3 3.1 c C 2025-03-17 22:35:58.015253 1910-01-03
#> 25 TRUE 2 2.1 b B 2025-03-17 22:35:57.015253 1910-01-02
#> 26 FALSE 1 1.1 a A 2025-03-17 22:35:56.015253 1910-01-01
cat("Read csv with header\n")
#> Read csv with header
ffy <- read.csv.ffdf(file=csvfile, header=TRUE)
ffy
#> ffdf (all open) dim=c(26,7), dimorder=c(1,2) row.names=NULL
#> ffdf virtual mapping
#> PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix
#> log log logical logical FALSE FALSE
#> int int integer integer FALSE FALSE
#> dbl dbl double double FALSE FALSE
#> fac fac integer integer FALSE FALSE
#> ord ord integer integer FALSE FALSE
#> dct dct integer integer FALSE FALSE
#> dat dat integer integer FALSE FALSE
#> PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol
#> log FALSE 1 1 1
#> int FALSE 2 1 1
#> dbl FALSE 3 1 1
#> fac FALSE 4 1 1
#> ord FALSE 5 1 1
#> dct FALSE 6 1 1
#> dat FALSE 7 1 1
#> PhysicalIsOpen
#> log TRUE
#> int TRUE
#> dbl TRUE
#> fac TRUE
#> ord TRUE
#> dct TRUE
#> dat TRUE
#> ffdf data
#> log int
#> 1 FALSE 13
#> 2 TRUE 12
#> 3 FALSE 11
#> 4 TRUE 10
#> 5 FALSE 9
#> 6 TRUE 8
#> 7 FALSE 7
#> 8 TRUE 6
#> : : :
#> 19 TRUE 8
#> 20 FALSE 7
#> 21 TRUE 6
#> 22 FALSE 5
#> 23 TRUE 4
#> 24 FALSE 3
#> 25 TRUE 2
#> 26 FALSE 1
#> dbl fac
#> 1 13.1 m
#> 2 12.1 l
#> 3 11.1 k
#> 4 10.1 j
#> 5 9.1 i
#> 6 8.1 h
#> 7 7.1 g
#> 8 6.1 f
#> : : :
#> 19 8.1 h
#> 20 7.1 g
#> 21 6.1 f
#> 22 5.1 e
#> 23 4.1 d
#> 24 3.1 c
#> 25 2.1 b
#> 26 1.1 a
#> ord dct
#> 1 M 2025-03-17 22:36:08.015253
#> 2 L 2025-03-17 22:36:07.015253
#> 3 K 2025-03-17 22:36:06.015253
#> 4 J 2025-03-17 22:36:05.015253
#> 5 I 2025-03-17 22:36:04.015253
#> 6 H 2025-03-17 22:36:03.015253
#> 7 G 2025-03-17 22:36:02.015253
#> 8 F 2025-03-17 22:36:01.015253
#> : : :
#> 19 H 2025-03-17 22:36:03.015253
#> 20 G 2025-03-17 22:36:02.015253
#> 21 F 2025-03-17 22:36:01.015253
#> 22 E 2025-03-17 22:36:00.015253
#> 23 D 2025-03-17 22:35:59.015253
#> 24 C 2025-03-17 22:35:58.015253
#> 25 B 2025-03-17 22:35:57.015253
#> 26 A 2025-03-17 22:35:56.015253
#> dat
#> 1 1910-01-13
#> 2 1910-01-12
#> 3 1910-01-11
#> 4 1910-01-10
#> 5 1910-01-09
#> 6 1910-01-08
#> 7 1910-01-07
#> 8 1910-01-06
#> : :
#> 19 1910-01-08
#> 20 1910-01-07
#> 21 1910-01-06
#> 22 1910-01-05
#> 23 1910-01-04
#> 24 1910-01-03
#> 25 1910-01-02
#> 26 1910-01-01
sapply(ffy[,], class)
#> log int dbl fac ord dct dat
#> "logical" "integer" "numeric" "factor" "factor" "factor" "factor"
message("reading with colClasses (an ordered factor wont'work in read.csv)")
#> reading with colClasses (an ordered factor wont'work in read.csv)
try(read.csv(file=csvfile, header=TRUE, colClasses=c(ord="ordered")
, stringsAsFactors = TRUE))
#> Error in methods::as(data[[i]], colClasses[i]) :
#> no method or default for coercing “character” to “ordered”
# TODO could fix this with the following two commands (Gabor Grothendieck)
# but does not know what bad side-effects this could have
#setOldClass("ordered")
#setAs("character", "ordered", function(from) ordered(from))
y <- read.csv(file=csvfile, header=TRUE, colClasses=c(dct="POSIXct", dat="Date")
, stringsAsFactors = TRUE)
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
)
rbind(
ram_class = sapply(y, function(x)paste(class(x), collapse = ","))
, ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ","))
, ff_vmode = vmode(ffy)
)
#> log int dbl fac ord
#> ram_class "logical" "integer" "numeric" "factor" "factor"
#> ff_class "logical" "integer" "numeric" "factor" "ordered,factor"
#> ff_vmode "logical" "integer" "double" "integer" "integer"
#> dct dat
#> ram_class "POSIXct,POSIXt" "Date"
#> ff_class "POSIXct,POSIXt" "Date"
#> ff_vmode "double" "double"
message("NOTE that reading in chunks can change the sequence of levels and thus the coding")
#> NOTE that reading in chunks can change the sequence of levels and thus the coding
message("(Sorting levels during chunked reading can be too expensive)")
#> (Sorting levels during chunked reading can be too expensive)
levels(ffy$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, first.rows=6
, next.rows=10
, VERBOSE=TRUE
)
#> read.table.ffdf 1..6 (6) csv-read=0.001sec ffdf-write=0.006sec
#> read.table.ffdf 7..16 (10) csv-read=0sec ffdf-write=0.003sec
#> read.table.ffdf 17..26 (10) csv-read=0sec ffdf-write=0.003sec
#> read.table.ffdf 27..26 (0) csv-read=0sec
#> csv-read=0.001sec ffdf-write=0.012sec TOTAL=0.013sec
levels(ffy$fac[])
#> [1] "h" "i" "j" "k" "l" "m" "a" "b" "c" "d" "e" "f" "g"
message("If we don't know the levels we can sort then after reading")
#> If we don't know the levels we can sort then after reading
message("(Will rewrite all factor codes)")
#> (Will rewrite all factor codes)
message("NOTE that you MUST assign the return value of sortLevels()")
#> NOTE that you MUST assign the return value of sortLevels()
ffy <- sortLevels(ffy)
levels(ffy$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
message("If we KNOW the levels we can fix levels upfront")
#> If we KNOW the levels we can fix levels upfront
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, first.rows=6
, next.rows=10
, levels=list(fac=letters, ord=LETTERS)
)
levels(ffy$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
message("Or we inspect a sufficiently large chunk of data and use those")
#> Or we inspect a sufficiently large chunk of data and use those
table(ffy$fac[], exclude=NULL)
#>
#> a b c d e f g h i j k l m n o p q r s t u v w x y z
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, nrows=13
, VERBOSE=TRUE
)
#> read.table.ffdf 1..13 (13) csv-read=0sec ffdf-write=0.006sec
#> csv-read=0sec ffdf-write=0.006sec TOTAL=0.006sec
message("append the rest to ffy")
#> append the rest to ffy
ffy <- read.csv.ffdf(
x=ffy
, file=csvfile
, header=FALSE
, skip=1 + nrow(ffy)
, VERBOSE=TRUE
)
#> read.table.ffdf 1..13 (13) csv-read=0.002sec ffdf-write=0.002sec
#> csv-read=0.002sec ffdf-write=0.002sec TOTAL=0.004sec
table(ffy$fac[], exclude=NULL)
#>
#> a b c d e f g h i j k l m
#> 2 2 2 2 2 2 2 2 2 2 2 2 2
message("We can turn unexpected factor levels to NA, say we only allowed a:l")
#> We can turn unexpected factor levels to NA, say we only allowed a:l
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, levels=list(fac=letters[1:12], ord=LETTERS[1:12])
, appendLevels=FALSE
)
sapply(colnames(ffy), function(i)sum(is.na(ffy[[i]][])))
#> log int dbl fac ord dct dat
#> 0 0 0 2 2 0 0
message("let's store some columns more efficient")
#> let's store some columns more efficient
sum(.ffbytes[vmode(ffy)])
#> [1] 36.25
ffy$log <- clone(ffy$log, vmode="boolean")
ffy$fac <- clone(ffy$fac, vmode="byte")
ffy$ord <- clone(ffy$ord, vmode="byte")
sum(.ffbytes[vmode(ffy)])
#> [1] 30.125
message("let's make a template with zero rows")
#> let's make a template with zero rows
ffx <- clone(ffy)
nrow(ffx) <- 0
message("reading with template and colClasses")
#> reading with template and colClasses
ffy <- read.csv.ffdf(
x=ffx
, file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, next.rows = 12
, VERBOSE = TRUE
)
#> read.table.ffdf 1..12 (12) csv-read=0sec ffdf-write=0.003sec
#> read.table.ffdf 13..24 (12) csv-read=0sec ffdf-write=0.003sec
#> read.table.ffdf 25..26 (2) csv-read=0sec ffdf-write=0.003sec
#> csv-read=0sec ffdf-write=0.009sec TOTAL=0.009sec
rbind(
ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ","))
, ff_vmode = vmode(ffy)
)
#> log int dbl fac ord
#> ff_class "logical" "integer" "numeric" "factor" "ordered,factor"
#> ff_vmode "boolean" "integer" "double" "byte" "byte"
#> dct dat
#> ff_class "POSIXct,POSIXt" "Date"
#> ff_vmode "double" "double"
levels(ffx$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
levels(ffy$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
message("reading with template without colClasses")
#> reading with template without colClasses
ffy <- read.csv.ffdf(
x=ffx
, file=csvfile
, header=TRUE
, next.rows = 12
, VERBOSE = TRUE
)
#> read.table.ffdf 1..12 (12) csv-read=0.001sec ffdf-write=0.003sec
#> read.table.ffdf 13..24 (12) csv-read=0sec ffdf-write=0.003sec
#> read.table.ffdf 25..26 (2) csv-read=0sec ffdf-write=0.003sec
#> csv-read=0.001sec ffdf-write=0.009sec TOTAL=0.01sec
rbind(
ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ","))
, ff_vmode = vmode(ffy)
)
#> log int dbl fac ord
#> ff_class "logical" "integer" "numeric" "factor" "ordered,factor"
#> ff_vmode "boolean" "integer" "double" "byte" "byte"
#> dct dat
#> ff_class "POSIXct,POSIXt" "Date"
#> ff_vmode "double" "double"
levels(ffx$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
levels(ffy$fac[])
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
message("We can fine-tune the creation of the ffdf")
#> We can fine-tune the creation of the ffdf
message("- let's create the ff files outside of fftempdir")
#> - let's create the ff files outside of fftempdir
message("- let's reduce required disk space and thus file.system cache RAM")
#> - let's reduce required disk space and thus file.system cache RAM
message("By default we had record size 36.25")
#> By default we had record size 36.25
ffy <- read.csv.ffdf(
file=csvfile
, header=TRUE
, colClasses=c(ord="ordered", dct="POSIXct", dat="Date")
, asffdf_args=list(
vmode = c(
log="boolean"
, int="byte"
, dbl="single"
, fac="nibble" # no NAs
, ord="nibble" # no NAs
, dct="single"
, dat="single"
)
, col_args=list(pattern = "./csv") # create in getwd() with prefix csv
)
)
vmode(ffy)
#> log int dbl fac ord dct dat
#> "boolean" "byte" "single" "nibble" "nibble" "single" "single"
message("This recordsize is more than 50% reduced")
#> This recordsize is more than 50% reduced
sum(.ffbytes[vmode(ffy)]) / 36.25
#> [1] 0.3896552
message("Don't forget to wrap-up files that are not in fftempdir")
#> Don't forget to wrap-up files that are not in fftempdir
delete(ffy); rm(ffy)
#> [1] TRUE
message("It's a good habit to also wrap-up temporary stuff (or at least know how this is done)")
#> It's a good habit to also wrap-up temporary stuff (or at least know how this is done)
rm(ffx); gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 1173157 62.7 1994352 106.6 1994352 106.6
#> Vcells 2190177 16.8 8790397 67.1 8790397 67.1
fwffile <- tempfile()
cat(file=fwffile, "123456", "987654", sep="\n")
x <- read.fwf(fwffile, widths=c(1,2,3), stringsAsFactors = TRUE) #> 1 23 456 \ 9 87 654
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,2,3))
stopifnot(identical(x, y[,]))
x <- read.fwf(fwffile, widths=c(1,-2,3), stringsAsFactors = TRUE) #> 1 456 \ 9 654
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,-2,3))
stopifnot(identical(x, y[,]))
unlink(fwffile)
cat(file=fwffile, "123", "987654", sep="\n")
x <- read.fwf(fwffile, widths=c(1,0, 2,3), stringsAsFactors = TRUE) #> 1 NA 23 NA \ 9 NA 87 654
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,0, 2,3))
stopifnot(identical(x, y[,]))
unlink(fwffile)
cat(file=fwffile, "123456", "987654", sep="\n")
x <- read.fwf(fwffile, widths=list(c(1,0, 2,3), c(2,2,2))
, stringsAsFactors = TRUE) #> 1 NA 23 456 98 76 54
y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=list(c(1,0, 2,3), c(2,2,2)))
stopifnot(identical(x, y[,]))
unlink(fwffile)
#> Warning: unknown factor values mapped to NA
unlink(csvfile)