First draft of function to diagnose problems in merges and key variables

This is a first effort. It works with 2 data frames and 1 key variable in each. It does not work if the by parameter includes more than one column name (but may work in future). The return is a list which includes full copies of the rows from the data frames in which trouble is observed.

mergeCheck(
  x,
  y,
  by,
  by.x = by,
  by.y = by,
  incomparables = c(NULL, NA, NaN, Inf, "\\s+", "")
)

Arguments

x: data frame
y: data frame
by: Commonly called the "key" variable. A column name to be used for merging (common to both x and y)
by.x: Column name in x to be used for merging. If not supplied, then by.x is assumed to be same as by.
by.y: Column name in y to be used for merging. If not supplied, then by.y is assumed to be same as by.
incomparables: values in the key (by) variable that are ignored for matching. We default to include these values as incomparables: c(NULL, NA, NaN, Inf, "\s+", ""). Note this is a larger list of incomparables than assumed by R merge (which assumes only NULL).

Value

A list of data structures that are displayed for keys and data sets. The return is list(keysBad, keysDuped, unmatched). unmatched is a list with 2 elements, the unmatched cases from x and y.

Author

Paul Johnson

Examples

df1 <- data.frame(id = 1:7, x = rnorm(7))
df2 <- data.frame(id = c(2:6, 9:10), x = rnorm(7))
mc1 <- mergeCheck(df1, df2, by = "id")
#> Merge difficulties detected
#> 
#> Unmatched cases from df1 and df2 :
#> df1 
#>   id          x
#> 1  1 -0.2603823
#> 7  7 -1.3534595
#> df2 
#>   id         x
#> 6  9 -1.889317
#> 7 10 -1.466877
## Use mc1 objects mc1$keysBad, mc1$keysDuped, mc1$unmatched
df1 <- data.frame(id = c(1:3, NA, NaN, "", " "), x = rnorm(7))
df2 <- data.frame(id = c(2:6, 5:6), x = rnorm(7))
mergeCheck(df1, df2, by = "id")
#> Merge difficulties detected
#> 
#> Unacceptable key values
#> df1 
#>     id          x
#> 4 <NA> -1.0696909
#> 5  NaN  1.8598919
#> 6      -0.9122561
#> Duplicated key values
#> df2 
#>   id          x
#> 4  5 -0.6919614
#> 5  6  1.7610419
#> 6  5  1.0528643
#> 7  6  0.9041829
#> Unmatched cases from df1 and df2 :
#> df1 
#>     id          x
#> 1    1 -1.1282967
#> 4 <NA> -1.0696909
#> 5  NaN  1.8598919
#> 6      -0.9122561
#> 7       0.3457086
#> df2 
#>   id          x
#> 3  4  1.1191217
#> 4  5 -0.6919614
#> 5  6  1.7610419
#> 6  5  1.0528643
#> 7  6  0.9041829
df1 <- data.frame(idx = c(1:5, NA, NaN), x = rnorm(7))
df2 <- data.frame(idy = c(2:6, 9:10), x = rnorm(7))
mergeCheck(df1, df2, by.x = "idx", by.y = "idy")
#> Merge difficulties detected
#> 
#> Unacceptable key values
#> df1 
#>   idx         x
#> 6  NA -1.976887
#> 7 NaN  1.071800
#> Unmatched cases from df1 and df2 :
#> df1 
#>   idx          x
#> 1   1 -0.9051318
#> 6  NA -1.9768874
#> 7 NaN  1.0717997
#> df2 
#>   idy          x
#> 5   6  0.4886346
#> 6   9 -0.7481265
#> 7  10  1.0207891