
Reading data from worksheets
readWorksheet-methods.RdReads data from worksheets of a workbook.
Usage
# S4 method for class 'workbook,numeric'
readWorksheet(object,sheet,startRow,startCol,endRow,endCol,autofitRow,autofitCol,
region,header,rownames,colTypes,forceConversion,dateTimeFormat,check.names,
useCachedValues,keep,drop, simplify, readStrategy)
# S4 method for class 'workbook,character'
readWorksheet(object,sheet,startRow,startCol,endRow,endCol,autofitRow,autofitCol,
region,header,rownames,colTypes,forceConversion,dateTimeFormat,check.names,
useCachedValues,keep,drop, simplify, readStrategy)Arguments
- object
The
workbookto use- sheet
The name or index of the worksheet to read from
- startRow
The index of the first row to read from. Defaults to
0meaning that the start row is determined automatically.- startCol
The index of the first column to read from. Defaults to
0meaning that the start column is determined automatically.- endRow
The index of the last row to read from. Defaults to
0meaning that the end row is determined automatically.- endCol
The index of the last column to read from. Defaults to
0meaning that the end column is determined automatically.- autofitRow
logicalspecifying if leading and trailing empty rows should be skipped. Defaults toTRUE.- autofitCol
logicalspecifying if leading and trailing empty columns should be skipped. Defaults toTRUE.- region
A range specifier in the form 'A10:B18'. This provides an alternative way to specify
startRow,startCol,endRowandendCol. Range specifications take precedence over index specifications.- header
Interpret the first row of the specified area as column headers. The default is
TRUE.- rownames
Index (
numeric) or name (character) of column that should be used as row names. The corresponding column will be removed from the data set. Defaults toNULLwhich means that no row names are applied. Row names must be eitherintegerorcharacter. Non-numeric columns will be coerced tocharacter.- colTypes
Column types to use when reading in the data. Specified as a
charactervector of the corresponding type names (seeXLC;XLC$DATA_TYPE.<?>). You may also use R class names such asnumeric,character,logicalandPOSIXt. The types are applied in the given order to the columns - elements are recycled if necessary. Defaults tocharacter(0)meaning that column types are determined automatically (see the Note section for more information).
By default, type conversions are only applied if the specified column type is a more generic type (e.g. from Numeric to String) - otherwiseNAis returned. TheforceConversionflag can be set to force conversion into less generic types where possible.- forceConversion
logicalspecifying if conversions to less generic types should be forced. Defaults toFALSEmeaning that if a column is specified to be of a certain type via thecolTypesargument and a more generic type is detected in the column, thenNAwill be returned (example: column is specified to be DateTime but a more generic String is found). SpecifyingforceConversion = TRUEwill try to enforce a conversion - if it succeeds the corresponding (converted) value will be returned, otherwiseNA. See the Note section for some additional information.- dateTimeFormat
Date/time format used when doing date/time conversions. Defaults to
getOption("XLConnect.dateTimeFormat"). This should be a POSIX format specifier according tostrptimealthough not all specifications have been implemented yet - the most important ones however are available. When using the '%OS' specification for fractional seconds (without an additional integer) 3 digits will be used by default (getOption("digits.secs")is not considered).- check.names
logicalspecifying if column names of the resultingdata.frameshould be checked to ensure that they are syntactically valid variable names and are not duplicated. See thecheck.namesargument ofdata.frame. Defaults toTRUE.- useCachedValues
logicalspecifying whether to read cached formula results from the workbook instead of re-evaluating them. This is particularly helpful in cases for reading data produced by Excel features not supported in XLConnect like references to external workbooks. Defaults toFALSE, which means that formulas will be evaluated by XLConnect.- keep
Vector of column names or indices to be kept in the output data frame. It is possible to specify either
keepordrop, but not both at the same time. Defaults toNULL. If a vector is passed as argument, it will be wrapped into a list. This list gets replicated to match the length of the other arguments. Example: ifsheet = c("Sheet1", "Sheet2", "Sheet3")andkeep = c(1,2),keepwill be internally converted intolist(c(1,2))and then replicated to match the number of sheets, i.e.keep = list(c(1,2), c(1,2), c(1,2)). The result is that the first two columns of each sheet are kept. Ifkeep = list(1,2)is specified, it will be replicated aslist(1,2,1), i.e. respectively the first, second and first column of the sheets "Sheet1", "Sheet2", "Sheet3" will be kept.- drop
Vector of column names or indices to be dropped in the output data frame. It is possible to specify either
keepordrop, but not both at the same time. Defaults toNULL. If a vector is passed as argument, it will be wrapped into a list. This list gets replicated to match the length of the other arguments. Example: ifsheet = c("Sheet1", "Sheet2", "Sheet3")anddrop = c(1,2),dropwill be internally converted intolist(c(1,2))and then replicated to match the number of sheets, i.e.drop = list(c(1,2), c(1,2), c(1,2)). The result is that the first two columns of each sheet are dropped. Ifdrop = list(1,2)is specified, it will be replicated aslist(1,2,1), i.e. respectively the first, second and first column of the sheets "Sheet1", "Sheet2", "Sheet3" will be dropped.- simplify
logicalspecifying if the result should be simplified, e.g. in case thedata.framewould only have one row or one column (and data types match). Simplifying here is identical to callingunliston the otherwise resultingdata.frame(usinguse.names = FALSE). The default isFALSE.- readStrategy
characterspecifying the reading strategy to use. Currently supported strategies are:"default"(default): Can handle all supported data types incl. date/time values and can deal directly with missing value identifiers (seesetMissingValue)"fast": Increased read performance. Date/time values are read as numeric (number of days since 1900-01-01; fractional days represent hours, minutes, and seconds) and only blank cells are recognized as missing (missing value identifiers as set insetMissingValueare ignored)
Details
Reads data from the worksheet specified by sheet. Data is read
starting at the top left corner specified by startRow and
startCol down to the bottom right corner specified by
endRow and endCol. If header = TRUE, the first row
is interpreted as column names of the resulting data.frame.
If
startRow <= 0 then the first available row in the sheet is assumed.
If endRow = 0 then the last available row in the sheet is assumed.
For endRow = -n with n > 0, the 'last row' - n rows is assumed. This
is useful in cases where you want to skip the last n rows.
If startCol <= 0 then the minimum column between startRow and
endRow is assumed. If endCol = 0 then the maximum column between
startRow and endRow is assumed. If endCol = -n with n > 0,
the maximum column between startRow and endRow except for the last n
columns is assumed.
In other words, if no boundaries are specified readWorksheet assumes the "bounding box" of the data as the corresponding boundaries.
The arguments autofitRow and autofitCol (both defaulting to TRUE) can be used to skip leading and trailing empty rows even in case startRow, endRow, startCol and endCol are specified to values > 0. This can be useful if data is expected within certain given boundaries but the exact location is not available.
If all four coordinate arguments are missing this behaves as above with
startRow = 0, startCol = 0, endRow = 0 and
endCol = 0. In this case readWorksheet assumes the
"bounding box" of the data as the corresponding boundaries.
All arguments (except object) are vectorized. As such, multiple worksheets (and also multiple data regions
from the same worksheet) can be read with one method call. If only one single data region is read, the return value
is a data.frame. If multiple data regions are specified, the return value is a list of data.frame's
returned in the order they have been specified. If worksheets have been specified by name, the list will be a
named list named by the corresponding worksheets.
Author
Martin Studer
Thomas Themel
Nicola Lambiase
Mirai Solutions GmbH https://mirai-solutions.ch
Note
If no specific column types (see argument colTypes) are specified,
readWorksheet tries to determine the resulting column types based on the read cell types. If different cell
types are found in a specific column, the most general of those is used and mapped to the corresponding R data type.
The order of data types from least to most general is Boolean (logical) < DateTime (POSIXct) <
Numeric (numeric) < String (character). E.g. if a column is read that contains cells of type Boolean,
Numeric and String then the resulting column in R would be character since character is the most general
type.
Some additional information with respect to forcing data type conversion
using forceConversion = TRUE:
Forcing conversion from String to Boolean:
TRUEis returned if and only if the target string is "true" (ignoring any capitalization). Any other string will returnFALSE.Forcing conversion from Numeric to DateTime: since Excel understands Dates/Times as Numerics with some additional formatting, a conversion from a Numeric to a DateTime is actually possible. Numerics in this case represent the number of days since 1900-01-00 (yes, day 00! - see https://web.archive.org/web/20240821110422/http://www.cpearson.com/excel/datetime.htm). Note that in R 0 is represented as 1899-12-31 since there is no 1900-01-00. Fractional days represent hours, minutes, and seconds.
Examples
if (FALSE) { # \dontrun{
## Example 1:
# mtcars xlsx file from demoFiles subfolder of package XLConnect
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
# Load workbook
wb <- loadWorkbook(demoExcelFile)
# Read worksheet 'mtcars' (providing no specific area bounds;
# with default header = TRUE)
data <- readWorksheet(wb, sheet = "mtcars")
## Example 2:
# mtcars xlsx file from demoFiles subfolder of package XLConnect
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
# Load workbook
wb <- loadWorkbook(demoExcelFile)
# Read worksheet 'mtcars' (providing area bounds; with default header = TRUE)
data <- readWorksheet(wb, sheet = "mtcars", startRow = 1, startCol = 3,
endRow = 15, endCol = 8)
## Example 3:
# mtcars xlsx file from demoFiles subfolder of package XLConnect
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
# Load workbook
wb <- loadWorkbook(demoExcelFile)
# Read worksheet 'mtcars' (providing area bounds using the region argument;
# with default header = TRUE)
data <- readWorksheet(wb, sheet = "mtcars", region = "C1:H15")
## Example 4:
# conversion xlsx file from demoFiles subfolder of package XLConnect
excelFile <- system.file("demoFiles/conversion.xlsx", package = "XLConnect")
# Load workbook
wb <- loadWorkbook(excelFile)
# Read worksheet 'Conversion' with pre-specified column types
# Note: in the worksheet all data was entered as strings!
# forceConversion = TRUE is used to force conversion from String
# into the less generic data types Numeric, DateTime & Boolean
df <- readWorksheet(wb, sheet = "Conversion", header = TRUE,
colTypes = c(XLC$DATA_TYPE.NUMERIC,
XLC$DATA_TYPE.DATETIME,
XLC$DATA_TYPE.BOOLEAN),
forceConversion = TRUE,
dateTimeFormat = "%Y-%m-%d %H:%M:%S")
## Example 5:
# mtcars xlsx file from demoFiles subfolder of package XLConnect
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
# Load workbook
wb <- loadWorkbook(demoExcelFile)
# Read the columns 1, 3 and 5 from the sheet 'mtcars' (with default header = TRUE)
data <- readWorksheet(wb, sheet = "mtcars", keep=c(1,3,5))
} # }