Create variable key template (in memory or in a file)

A variable key is a human readable document that describes the variables in a data set. A key can be revised and re-imported by R to recode data. This might also be referred to as a "programmable codebook." This function inspects a data frame, takes notice of its variable names, their classes, and legal values, and then it creates a table summarizing that information. The aim is to create a document that principal investigators and research assistants can use to keep a project well organized. Please see the vignette in this package.

keyTemplate(
  dframe,
  long = FALSE,
  sort = FALSE,
  file = NULL,
  max.levels = 15,
  missings = NULL,
  missSymbol = ".",
  safeNumericToInteger = TRUE,
  trimws = "both",
  varlab = FALSE
)

Arguments

dframe: A data frame
long: Default FALSE.
sort: Default FALSE. Should the rows representing the variables be sorted alphabetically? Otherwise, they appear in the order in which they were included in the original dataset.
file: DEFAULT NULL, meaning no file is produced. Choose a file name ending in either "csv" (for comma separated variables), "xlsx" (compatible with Microsoft Excel), or "rds" (R serialization data). The file name will be used to select among the 3 storage formats. XLSX output requires the openxlsx package.
max.levels: How high is the limit on the number of values for discrete (integer, character, and Date) variables? Default = 15. If observed number exceeds max.levels, we conclude the author should not assign new values in the key and only the missing value will be included in the key as a "placeholder". This does not affect variables declared as factor or ordered variables, for which all levels are included in all cases.
missings: Values in exising data which should be treated as missing in the new key. Character string in format acceptable to the assignMissing function. Can be a string with several missing indicators"1;2;3;(8,10);[22,24];> 99;< 2".
missSymbol: Default ".". A character string used to represent missing values in the key that is created. Relevant (mostly) for the key's value_new column. Default is the period, ".". Because R's symbol NA can be mistaken for the character string "NA", we use a different (hopefully unmistakable) symbol in the key.
safeNumericToInteger: Default TRUE: Should we treat values which appear to be integers as integers? If a column is numeric, it might be safe to treat it as an integer. In many csv data sets, the values coded c(1, 2, 3) are really integers, not floats c(1.0, 2.0, 3.0). See safeInteger.
trimws: Default is "both", user can change to "left", "right", or set as NULL to avoid any trimming.
varlab: A key can have a companion data structure for variable labels. Default is FALSE, but the value may also be TRUE or a named vector of variable labels, such as c("x1" = "happiness", "x2" = "wealth"). The labels become an attribute of the key object. See Details for information on storage of varlabs in saved key files.

Value

A key in the form of a data frame. May also be saved on disk if the file argument is supplied. The key may have an attribute "varlab", variable labels.

Details

The variable key can be created in two formats, wide and long. The original style of the variable key, wide, has one row per variable. It has a style for compact notation about current values and required recodes. That is more compact, probably easier for experts to read, but perhaps more difficult to edit. The long style variable key has one row per value per variable. Thus, in a larger project, the long key can have many rows. However, in a larger project, the long style key is easier to edit with a spread sheet program.

After a key is created, it should be re-imported into R with the kutils::keyImport function. Then the key structure can guide the importation and recoding of the data set.

Concerning the varlab attribute. Run attr(key, "varlab" to review existing labels, if any.

Storing the variable labels in files requires some care because the rds, xlsx, and csv formats have different capabilities. The rds storage format saves all attributes without difficulty. In contrast, because csv and xlsx do not save attributes, the varlabs are stored as separate character matrices. For xlsx files, the varlab object is saved as a second sheet in xlsx file, while in csv a second file suffixed "-varlab.csv" is created.

Author

Paul Johnson <pauljohn@ku.edu>

Examples

set.seed(234234)
N <- 200
mydf <- data.frame(x5 = rnorm(N),
                   x4 = rpois(N, lambda = 3),
                   x3 = ordered(sample(c("lo", "med", "hi"),
                   size = N, replace=TRUE),
                   levels = c("med", "lo", "hi")),
                   x2 = letters[sample(c(1:4,6), N, replace = TRUE)],
                   x1 = factor(sample(c("cindy", "bobby", "marcia",
                                        "greg", "peter"), N,
                   replace = TRUE)),
                   x7 = ordered(letters[sample(c(1:4,6), N, replace = TRUE)]),
                   x6 = sample(c(1:5), N, replace = TRUE),
                   stringsAsFactors = FALSE)
mydf$x4[sample(1:N, 10)] <- 999
mydf$x5[sample(1:N, 10)] <- -999

## Note: If we change this example data, we need to save a copy in
## "../inst/extdata" for packacing
dn <- tempdir()
write.csv(mydf, file = file.path(dn, "mydf.csv"), row.names = FALSE)
mydf.templ <- keyTemplate(mydf, file = file.path(dn, "mydf.templ.csv"),
                          varlab = TRUE)
mydf.templ_long <- keyTemplate(mydf, long = TRUE,
                            file = file.path(dn, "mydf.templlong.csv"),
                            varlab = TRUE)

mydf.templx <- keyTemplate(mydf, file = file.path(dn, "mydf.templ.xlsx"),
                            varlab = TRUE)
mydf.templ_longx <- keyTemplate(mydf, long = TRUE,
                             file = file.path(dn, "mydf.templ_long.xlsx"),
                             varlab = TRUE)
## Check the varlab attribute
attr(mydf.templ, "varlab")
#>   x5   x4   x3   x2   x1   x7   x6 
#> "x5" "x4" "x3" "x2" "x1" "x7" "x6" 
mydf.tmpl2 <- keyTemplate(mydf,
                         varlab = c(x5 = "height", x4 = "age",
                         x3 = "intelligence", x1 = "Name"))
## Check the varlab attribute
attr(mydf.tmpl2, "varlab")
#>             x5             x4             x3             x1 
#>       "height"          "age" "intelligence"         "Name" 

## Try with the national longitudinal study data
data(natlongsurv)
natlong.templ <- keyTemplate(natlongsurv,
                          file = file.path(dn, "natlongsurv.templ.csv"),
                          max.levels = 15, varlab = TRUE, sort = TRUE)

natlong.templlong <- keyTemplate(natlongsurv, long = TRUE,
                   file = file.path(dn, "natlongsurv.templ_long.csv"), sort = TRUE)
if(interactive()) View(natlong.templlong)
natlong.templlong2 <- keyTemplate(natlongsurv, long = TRUE,
                      missings = "<0", max.levels = 50, sort = TRUE,
                      varlab = TRUE)
if(interactive()) View(natlong.templlong2)

natlong.templwide2 <- keyTemplate(natlongsurv, long = FALSE,
                      missings = "<0", max.levels = 50, sort = TRUE)
if(interactive()) View(natlong.templwide2)

all.equal(wide2long(natlong.templwide2), natlong.templlong2)
#> [1] TRUE

head(keyTemplate(natlongsurv, file = file.path(dn, "natlongsurv.templ.xlsx"),
             max.levels = 15, varlab = TRUE, sort = TRUE), 10)
#>          name_old name_new class_old class_new
#> R0000100 R0000100 R0000100   integer   integer
#> R0003300 R0003300 R0003300   integer   integer
#> R0005700 R0005700 R0005700   integer   integer
#> R0060300 R0060300 R0060300   integer   integer
#> R1051600 R1051600 R1051600   integer   integer
#> R1302000 R1302000 R1302000   integer   integer
#> R1302100 R1302100 R1302100   integer   integer
#> R1303400 R1303400 R1303400   integer   integer
#> R6235600 R6235600 R6235600   integer   integer
#> R6502300 R6502300 R6502300   integer   integer
#>                                     value_old
#> R0000100                                    .
#> R0003300                        1|2|3|4|5|6|.
#> R0005700                                    .
#> R0060300                                    .
#> R1051600 -5|-4|9|10|11|12|13|14|15|16|17|18|.
#> R1302000                          -5|-4|0|1|.
#> R1302100                                    .
#> R1303400                                    .
#> R6235600                                    .
#> R6502300                          -4|1|3|14|.
#>                                     value_new missings recodes
#> R0000100                                    .                 
#> R0003300                        1|2|3|4|5|6|.                 
#> R0005700                                    .                 
#> R0060300                                    .                 
#> R1051600 -5|-4|9|10|11|12|13|14|15|16|17|18|.                 
#> R1302000                          -5|-4|0|1|.                 
#> R1302100                                    .                 
#> R1303400                                    .                 
#> R6235600                                    .                 
#> R6502300                          -4|1|3|14|.                 
head(keyTemplate(natlongsurv, file = file.path(dn, "natlongsurv.templ.xlsx"),
             long = TRUE, max.levels = 15, varlab = TRUE, sort = TRUE), 10)
#>    name_old name_new class_old class_new value_old value_new missings recodes
#> 1  R0000100 R0000100   integer   integer         .         .                 
#> 2  R0003300 R0003300   integer   integer         1         1                 
#> 3  R0003300 R0003300   integer   integer         2         2                 
#> 4  R0003300 R0003300   integer   integer         3         3                 
#> 5  R0003300 R0003300   integer   integer         4         4                 
#> 6  R0003300 R0003300   integer   integer         5         5                 
#> 7  R0003300 R0003300   integer   integer         6         6                 
#> 8  R0003300 R0003300   integer   integer         .         .                 
#> 9  R0005700 R0005700   integer   integer         .         .                 
#> 10 R0060300 R0060300   integer   integer         .         .                 

list.files(dn)
#>  [1] "downlit"                      "mydf.csv"                    
#>  [3] "mydf.templ-varlab.csv"        "mydf.templ.csv"              
#>  [5] "mydf.templ.xlsx"              "mydf.templ_long.xlsx"        
#>  [7] "mydf.templlong-varlab.csv"    "mydf.templlong.csv"          
#>  [9] "natlongsurv.templ-varlab.csv" "natlongsurv.templ.csv"       
#> [11] "natlongsurv.templ.xlsx"       "natlongsurv.templ_long.csv"  
#> [13] "test-20251014-2024.6"         "test.1-20251014-2024.txt"    
#> [15] "test.2-20251014-2024.txt"     "test.3-20251014-2024.txt"    
#> [17] "test.4-20251014-2024.txt"     "test.5-20251014-2024.txt"    
#> [19] "test1"                        "test2"                       
#> [21] "test3"