deduper.RdIn Qualtrix data, we sometimes find repeated words in column names. For whatever reason, the variable names have repeated words like "Philadelphia_Philadelphia_3". This function changes a vector c("Philadelphia_Philadelphia_3", "Denver_Denver_4") to c("Philadelphia_3", "Denver_4"). It is non destructive, so that other values will not be altered.
deduper(x, sep = ",_\\s-", n = NULL)Character vector
Delimiter. A regular expression indicating the point at which to split the strings before checking for duplicates. Default will look for repeat separated by comma, underscore, or one space character.
Limit on number of duplicates to remove. Default, NULL, means delete all duplicates at the beginning of a string.
Cleaned up vector.
See https://stackoverflow.com/questions/43711240/r-regular-expression-match-omit-several-repeats
x <- c("Philadelphia_Philadelphia_3", "Denver_Denver_4",
"Den_Den_Den_Den_Den_Den_Den_5")
deduper(x)
#> [1] "Philadelphia_3" "Denver_4" "Den_5"
deduper(x, n = 2)
#> [1] "Philadelphia_3" "Denver_4" "Den_Den_Den_Den_Den_5"
deduper(x, n = 3)
#> [1] "Philadelphia_3" "Denver_4" "Den_Den_Den_Den_5"
deduper(x, n = 4)
#> [1] "Philadelphia_3" "Denver_4" "Den_Den_Den_5"
x <- c("Philadelphia,Philadelphia_3", "Denver Denver_4")
## Shows comma also detected by default
deduper(x)
#> [1] "Philadelphia_3" "Denver_4"
## Works even if delimiter is inside matched string,
## or separators vary
x <- c("Den_5_Den_5_Den_5,Den_5 Den_5")
deduper(x)
#> [1] "Den_5"
## generate vector
x <- replicate(10, paste(sample(letters, 5), collapse = ""))
n <- c(paste0("_", sample(1:10, 5)), rep("", 5))
x <- paste0(x, "_", x, n, n)
x
#> [1] "psxoh_psxoh_1_1" "kqxdp_kqxdp_10_10" "dzlpw_dzlpw_9_9"
#> [4] "almze_almze_3_3" "czlha_czlha_5_5" "zkvem_zkvem"
#> [7] "lerap_lerap" "vhlny_vhlny" "uixtk_uixtk"
#> [10] "weguz_weguz"
deduper(x)
#> [1] "psxoh_1_1" "kqxdp_10_10" "dzlpw_9_9" "almze_3_3" "czlha_5_5"
#> [6] "zkvem" "lerap" "vhlny" "uixtk" "weguz"