mergeCheck.RdThis is a first effort. It works with 2 data frames and 1 key variable in each. It does not work if the by parameter includes more than one column name (but may work in future). The return is a list which includes full copies of the rows from the data frames in which trouble is observed.
mergeCheck(
x,
y,
by,
by.x = by,
by.y = by,
incomparables = c(NULL, NA, NaN, Inf, "\\s+", "")
)data frame
data frame
Commonly called the "key" variable. A column name to be
used for merging (common to both x and y)
Column name in x to be used for merging. If not
supplied, then by.x is assumed to be same as by.
Column name in y to be used for merging. If not
supplied, then by.y is assumed to be same as by.
values in the key (by) variable that are ignored for matching. We default to include these values as incomparables: c(NULL, NA, NaN, Inf, "\s+", ""). Note this is a larger list of incomparables than assumed by R merge (which assumes only NULL).
A list of data structures that are displayed for keys and
data sets. The return is list(keysBad, keysDuped,
unmatched). unmatched is a list with 2 elements, the
unmatched cases from x and y.
df1 <- data.frame(id = 1:7, x = rnorm(7))
df2 <- data.frame(id = c(2:6, 9:10), x = rnorm(7))
mc1 <- mergeCheck(df1, df2, by = "id")
#> Merge difficulties detected
#>
#> Unmatched cases from df1 and df2 :
#> df1
#> id x
#> 1 1 -0.2603823
#> 7 7 -1.3534595
#> df2
#> id x
#> 6 9 -1.889317
#> 7 10 -1.466877
## Use mc1 objects mc1$keysBad, mc1$keysDuped, mc1$unmatched
df1 <- data.frame(id = c(1:3, NA, NaN, "", " "), x = rnorm(7))
df2 <- data.frame(id = c(2:6, 5:6), x = rnorm(7))
mergeCheck(df1, df2, by = "id")
#> Merge difficulties detected
#>
#> Unacceptable key values
#> df1
#> id x
#> 4 <NA> -1.0696909
#> 5 NaN 1.8598919
#> 6 -0.9122561
#> Duplicated key values
#> df2
#> id x
#> 4 5 -0.6919614
#> 5 6 1.7610419
#> 6 5 1.0528643
#> 7 6 0.9041829
#> Unmatched cases from df1 and df2 :
#> df1
#> id x
#> 1 1 -1.1282967
#> 4 <NA> -1.0696909
#> 5 NaN 1.8598919
#> 6 -0.9122561
#> 7 0.3457086
#> df2
#> id x
#> 3 4 1.1191217
#> 4 5 -0.6919614
#> 5 6 1.7610419
#> 6 5 1.0528643
#> 7 6 0.9041829
df1 <- data.frame(idx = c(1:5, NA, NaN), x = rnorm(7))
df2 <- data.frame(idy = c(2:6, 9:10), x = rnorm(7))
mergeCheck(df1, df2, by.x = "idx", by.y = "idy")
#> Merge difficulties detected
#>
#> Unacceptable key values
#> df1
#> idx x
#> 6 NA -1.976887
#> 7 NaN 1.071800
#> Unmatched cases from df1 and df2 :
#> df1
#> idx x
#> 1 1 -0.9051318
#> 6 NA -1.9768874
#> 7 NaN 1.0717997
#> df2
#> idy x
#> 5 6 0.4886346
#> 6 9 -0.7481265
#> 7 10 1.0207891