This is a first effort. It works with 2 data frames and 1 key variable in each. It does not work if the by parameter includes more than one column name (but may work in future). The return is a list which includes full copies of the rows from the data frames in which trouble is observed.

mergeCheck(
  x,
  y,
  by,
  by.x = by,
  by.y = by,
  incomparables = c(NULL, NA, NaN, Inf, "\\s+", "")
)

Arguments

x

data frame

y

data frame

by

Commonly called the "key" variable. A column name to be used for merging (common to both x and y)

by.x

Column name in x to be used for merging. If not supplied, then by.x is assumed to be same as by.

by.y

Column name in y to be used for merging. If not supplied, then by.y is assumed to be same as by.

incomparables

values in the key (by) variable that are ignored for matching. We default to include these values as incomparables: c(NULL, NA, NaN, Inf, "\s+", ""). Note this is a larger list of incomparables than assumed by R merge (which assumes only NULL).

Value

A list of data structures that are displayed for keys and data sets. The return is list(keysBad, keysDuped, unmatched). unmatched is a list with 2 elements, the unmatched cases from x and y.

Author

Paul Johnson

Examples

df1 <- data.frame(id = 1:7, x = rnorm(7))
df2 <- data.frame(id = c(2:6, 9:10), x = rnorm(7))
mc1 <- mergeCheck(df1, df2, by = "id")
#> Merge difficulties detected
#> 
#> Unmatched cases from df1 and df2 :
#> df1 
#>   id          x
#> 1  1 -0.2603823
#> 7  7 -1.3534595
#> df2 
#>   id         x
#> 6  9 -1.889317
#> 7 10 -1.466877
## Use mc1 objects mc1$keysBad, mc1$keysDuped, mc1$unmatched
df1 <- data.frame(id = c(1:3, NA, NaN, "", " "), x = rnorm(7))
df2 <- data.frame(id = c(2:6, 5:6), x = rnorm(7))
mergeCheck(df1, df2, by = "id")
#> Merge difficulties detected
#> 
#> Unacceptable key values
#> df1 
#>     id          x
#> 4 <NA> -1.0696909
#> 5  NaN  1.8598919
#> 6      -0.9122561
#> Duplicated key values
#> df2 
#>   id          x
#> 4  5 -0.6919614
#> 5  6  1.7610419
#> 6  5  1.0528643
#> 7  6  0.9041829
#> Unmatched cases from df1 and df2 :
#> df1 
#>     id          x
#> 1    1 -1.1282967
#> 4 <NA> -1.0696909
#> 5  NaN  1.8598919
#> 6      -0.9122561
#> 7       0.3457086
#> df2 
#>   id          x
#> 3  4  1.1191217
#> 4  5 -0.6919614
#> 5  6  1.7610419
#> 6  5  1.0528643
#> 7  6  0.9041829
df1 <- data.frame(idx = c(1:5, NA, NaN), x = rnorm(7))
df2 <- data.frame(idy = c(2:6, 9:10), x = rnorm(7))
mergeCheck(df1, df2, by.x = "idx", by.y = "idy")
#> Merge difficulties detected
#> 
#> Unacceptable key values
#> df1 
#>   idx         x
#> 6  NA -1.976887
#> 7 NaN  1.071800
#> Unmatched cases from df1 and df2 :
#> df1 
#>   idx          x
#> 1   1 -0.9051318
#> 6  NA -1.9768874
#> 7 NaN  1.0717997
#> df2 
#>   idy          x
#> 5   6  0.4886346
#> 6   9 -0.7481265
#> 7  10  1.0207891