Removes redundant words from beginnings of character strings

In Qualtrix data, we sometimes find repeated words in column names. For whatever reason, the variable names have repeated words like "Philadelphia_Philadelphia_3". This function changes a vector c("Philadelphia_Philadelphia_3", "Denver_Denver_4") to c("Philadelphia_3", "Denver_4"). It is non destructive, so that other values will not be altered.

deduper(x, sep = ",_\\s-", n = NULL)

Arguments

x: Character vector
sep: Delimiter. A regular expression indicating the point at which to split the strings before checking for duplicates. Default will look for repeat separated by comma, underscore, or one space character.
n: Limit on number of duplicates to remove. Default, NULL, means delete all duplicates at the beginning of a string.

Value

Cleaned up vector.

Details

See https://stackoverflow.com/questions/43711240/r-regular-expression-match-omit-several-repeats

Author

Paul Johnson <pauljohn@ku.edu>

Examples

x <- c("Philadelphia_Philadelphia_3", "Denver_Denver_4",
        "Den_Den_Den_Den_Den_Den_Den_5")
deduper(x)
#> [1] "Philadelphia_3" "Denver_4"       "Den_5"         
deduper(x, n = 2)
#> [1] "Philadelphia_3"        "Denver_4"              "Den_Den_Den_Den_Den_5"
deduper(x, n = 3)
#> [1] "Philadelphia_3"    "Denver_4"          "Den_Den_Den_Den_5"
deduper(x, n = 4)
#> [1] "Philadelphia_3" "Denver_4"       "Den_Den_Den_5" 
x <- c("Philadelphia,Philadelphia_3", "Denver Denver_4")
## Shows comma also detected by default
deduper(x)
#> [1] "Philadelphia_3" "Denver_4"      
## Works even if delimiter is inside matched string,
## or separators vary
 x <- c("Den_5_Den_5_Den_5,Den_5 Den_5")
deduper(x)
#> [1] "Den_5"
## generate vector
x <- replicate(10, paste(sample(letters, 5), collapse = ""))
n <- c(paste0("_", sample(1:10, 5)), rep("", 5))
x <- paste0(x, "_", x, n, n)
x
#>  [1] "psxoh_psxoh_1_1"   "kqxdp_kqxdp_10_10" "dzlpw_dzlpw_9_9"  
#>  [4] "almze_almze_3_3"   "czlha_czlha_5_5"   "zkvem_zkvem"      
#>  [7] "lerap_lerap"       "vhlny_vhlny"       "uixtk_uixtk"      
#> [10] "weguz_weguz"      
deduper(x)
#>  [1] "psxoh_1_1"   "kqxdp_10_10" "dzlpw_9_9"   "almze_3_3"   "czlha_5_5"  
#>  [6] "zkvem"       "lerap"       "vhlny"       "uixtk"       "weguz"