Update a Data Frame or Cleanup a Data Frame after Importing
upData.Rdcleanup.import will correct errors and shrink
the size of data frames. By default, double precision numeric
variables are changed to integer when they contain no fractional components.
Infinite values or values greater than 1e20 in absolute value are set
to NA. This solves problems of importing Excel spreadsheets that
contain occasional character values for numeric columns, as S
converts these to Inf without warning. There is also an option to
convert variable names to lower case and to add labels to variables.
The latter can be made easier by importing a CNTLOUT dataset created
by SAS PROC FORMAT and using the sasdict option as shown in the
example below. cleanup.import can also transform character or
factor variables to dates.
upData is a function facilitating the updating of a data frame
without attaching it in search position one. New variables can be
added, old variables can be modified, variables can be removed or renamed, and
"labels" and "units" attributes can be provided.
Observations can be subsetted. Various checks
are made for errors and inconsistencies, with warnings issued to help
the user. Levels of factor variables can be replaced, especially
using the list notation of the standard merge.levels
function. Unless force.single is set to FALSE,
upData also converts double precision vectors to integer if no
fractional values are present in
a vector. upData is also used to process R workspace objects
created by StatTransfer, which puts variable and value labels as attributes on
the data frame rather than on each variable. If such attributes are
present, they are used to define all the labels and value labels
(through conversion to factor variables) before any label changes
take place, and force.single is set to a default of
FALSE, as StatTransfer already does conversion to integer.
Variables having labels but not classed "labelled" (e.g., data
imported using the haven package) have that class added to them
by upData.
The dataframeReduce function removes variables from a data frame
that are problematic for certain analyses. Variables can be removed
because the fraction of missing values exceeds a threshold, because they
are character or categorical variables having too many levels, or
because they are binary and have too small a prevalence in one of the
two values. Categorical variables can also have their levels combined
when a level is of low prevalence. A data frame listing actions take
is return as attribute "info" to the main returned data frame.
Usage
cleanup.import(obj, labels, lowernames=FALSE,
force.single=TRUE, force.numeric=TRUE, rmnames=TRUE,
big=1e20, sasdict, print, datevars=NULL, datetimevars=NULL,
dateformat='%F',
fixdates=c('none','year'),
autodate=FALSE, autonum=FALSE, fracnn=0.3,
considerNA=NULL, charfactor=FALSE)
upData(object, ...,
subset, rename, drop, keep, labels, units, levels, force.single=TRUE,
lowernames=FALSE, caplabels=FALSE, classlab=FALSE, moveUnits=FALSE,
charfactor=FALSE, print=TRUE, html=FALSE)
dataframeReduce(data, fracmiss=1, maxlevels=NULL, minprev=0, print=TRUE)Arguments
- obj
a data frame or list
- object
a data frame or list
- data
a data frame
- force.single
By default, double precision variables are converted to single precision (in S-Plus only) unless
force.single=FALSE.force.single=TRUEwill also convert vectors having only integer values to have a storage mode of integer, in R or S-Plus.- force.numeric
Sometimes importing will cause a numeric variable to be changed to a factor vector. By default,
cleanup.importwill check each factor variable to see if the levels contain only numeric values and"". In that case, the variable will be converted to numeric, with""converted to NA. Setforce.numeric=FALSEto prevent this behavior.- rmnames
set to `F' to not have `cleanup.import' remove `names' or `.Names' attributes from variables
- labels
a character vector the same length as the number of variables in
obj. These character values are taken to be variable labels in the same order of variables inobj. ForupData,labelsis a named list or named vector with variables in no specific order.- lowernames
set this to
TRUEto change variable names to lower case.upDatadoes this before applying any other changes, so variable names given inside arguments toupDataneed to be lower case iflowernames==TRUE.- big
a value such that values larger than this in absolute value are set to missing by
cleanup.import- sasdict
the name of a data frame containing a raw imported SAS PROC CONTENTS CNTLOUT= dataset. This is used to define variable names and to add attributes to the new data frame specifying the original SAS dataset name and label.
set to
TRUEorFALSEto force or prevent printing of the current variable number being processed. By default, such messages are printed if the product of the number of variables and number of observations inobjexceeds 500,000. FordataframeReducesetprinttoFALSEto suppress printing information about dropped or modified variables. Similar forupData.- datevars
character vector of names (after
lowernamesis applied) of variables to consider as a factor or character vector containing dates in a format matchingdateformat. The default is"%F"which uses the yyyy-mm-dd format.- datetimevars
character vector of names (after
lowernamesis applied) of variables to consider to be date-time variables, with date formats as described underdatevarsfollowed by a space followed by time in hh:mm:ss format.chronis used to store date-time variables. If all times in the variable are 00:00:00 the variable will be converted to an ordinary date variable.- dateformat
for
cleanup.importis the input format (seestrptime)- fixdates
for any of the variables listed in
datevarsthat have adateformatthatcleanup.importunderstands, specifyingfixdatesallows corrections of certain formatting inconsistencies before the fields are attempted to be converted to dates (the default is to assume that thedateformatis followed for all observation fordatevars). Currentlyfixdates='year'is implemented, which will cause 2-digit or 4-digit years to be shifted to the alternate number of digits whendateformis the default"%F"or is"%y-%m-%d","%m/%d/%y", or"%m/%d/%Y". Two-digits years are padded with20on the left. Setdateformatto the desired format, not the exceptional format.- autodate
set to
TRUEto havecleanup.importdetermine and automatically handle factor or character vectors that mainly contain dates of the form YYYY-mm-dd, mm/dd/YYYY, YYYY, or mm/YYYY, where the later two are imputed to, respectively, July 3 and the 15th of the month. Takes effect when the fraction of non-dates (of non-missing values) is less thanfracnnto allow for some free text such as"unknown". Attributesspecial.missandimputedare created for the vector so thatdescribe()will inform the user. Illegal values are converted toNAs and stored in thespecial.missattribute.- autonum
set to
TRUEto havecleanup.importexamine (afterautodate) character and factor variables to see if they are legal numerics exact for at most a fraction offracnnof non-missing non-numeric values. Qualifying variables are converted to numeric, and illegal values set toNAand stored in thespecial.missattribute to enhancedescribeoutput.- fracnn
see
autodateandautonum- considerNA
for
autodateandautonum, considers character values in the vectorconsiderNAto be the same asNA. Leading and trailing white space and upper/lower case are ignored.- charfactor
set to
TRUEto change character variables to factors if they have fewer than n/2 unique values. Null strings and blanks are converted toNAs.- ...
for
upData, one or more expressions of the formvariable=expression, to derive new variables or change old ones.- subset
an expression that evaluates to a logical vector specifying which rows of
objectshould be retained. The expressions should use the original variable names, i.e., before any variables are renamed but afterlowernamestakes effect.- rename
list or named vector specifying old and new names for variables. Variables are renamed before any other operations are done. For example, to rename variables
ageandsexto respectivelyAgeandgender, specifyrename=list(age="Age", sex="gender")orrename=c(age=...).- drop
a vector of variable names to remove from the data frame
- keep
a vector of variable names to keep, with all other variables dropped
- units
a named vector or list defining
"units"attributes of variables, in no specific order- levels
a named list defining
"levels"attributes for factor variables, in no specific order. The values in this list may be character vectors redefininglevels(in order) or another list (seemerge.levelsif using S-Plus).- caplabels
set to
TRUEto capitalize the first letter of each word in each variable label- classlab
set to
TRUE(the old default behavior) to automatically haveupDatamake variables having a"label"attribute haveclassof"labelled". Note that when thelabelsargument toupDatais given, these createlabelled-class variables as always.- moveUnits
set to
TRUEto look for units of measurements in variable labels and move them to a"units"attribute. If an expression in a label is enclosed in parentheses or brackets it is assumed to be units ifmoveUnits=TRUE.- html
set to
TRUEto print conversion information as html vertabim at 0.6 size. The user will need to putresults='asis'in aknitrchunk header to properly render this output.- fracmiss
the maximum permissable proportion of
NAs for a variable to be kept. Default is to keep all variables no matter how manyNAs are present.- maxlevels
the maximum number of levels of a character or categorical or factor variable before the variable is dropped
- minprev
the minimum proportion of non-missing observations in a category for a binary variable to be retained, and the minimum relative frequency of a category before it will be combined with other small categories
Examples
if (FALSE) { # \dontrun{
dat <- read.table('myfile.asc')
dat <- cleanup.import(dat)
} # }
dat <- data.frame(a=1:3, d=c('01/02/2004',' 1/3/04',''))
cleanup.import(dat, datevars='d', dateformat='%m/%d/%y', fixdates='year')
#> a d
#> 1 1 2004-01-02
#> 2 2 2004-01-03
#> 3 3 <NA>
dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3)
dat2 <- upData(dat, x=x^2, x=x-5, m=x/10,
rename=c(a='x'), drop='z',
labels=c(x='X', y='test'),
levels=list(y=list(a='a',b=c('b1','b2'))))
#> Input object size: 1232 bytes; 3 variables 3 observations
#> Renamed variable a to x
#> Modified variable x
#> Modified variable x
#> Added variable m
#> Dropped variable z
#> New object size: 2392 bytes; 3 variables 3 observations
dat2
#> x y m
#> 1 -4.98 a -0.498
#> 2 -4.92 b -0.492
#> 3 -4.82 b -0.482
describe(dat2)
#> dat2
#>
#> 3 Variables 3 Observations
#> --------------------------------------------------------------------------------
#> x : X
#> n missing distinct Info Mean pMedian Gmd
#> 3 0 3 1 -4.905 -4.908 0.1088
#>
#> Value -4.98 -4.92 -4.82
#> Frequency 1 1 1
#> Proportion 0.333 0.333 0.333
#> --------------------------------------------------------------------------------
#> y : test
#> n missing distinct
#> 3 0 2
#>
#> Value a b
#> Frequency 1 2
#> Proportion 0.333 0.667
#> --------------------------------------------------------------------------------
#> m
#> n missing distinct Info Mean pMedian Gmd
#> 3 0 3 1 -0.4905 -0.4908 0.01088
#>
#> Value -0.498 -0.492 -0.482
#> Frequency 1 1 1
#> Proportion 0.333 0.333 0.333
#> --------------------------------------------------------------------------------
dat <- dat2 # copy to original name and delete dat2 if OK
rm(dat2)
dat3 <- upData(dat, X=X^2, subset = x < (3/7)^2 - 5, rename=c(x='X'))
#> Input object size: 2392 bytes; 3 variables 3 observations
#> Renamed variable x to X
#> Modified variable X
#> New object size: 2352 bytes; 3 variables 2 observations
# Remove hard to analyze variables from a redundancy analysis of all
# variables in the data frame
d <- dataframeReduce(dat, fracmiss=.1, minprev=.05, maxlevels=5)
# Could run redun(~., data=d) at this point or include dataframeReduce
# arguments in the call to redun
# If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict,
# the LABELs from this dataset can be added to the data. Let's also
# convert names to lower case for the main data file
if (FALSE) { # \dontrun{
mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict)
} # }