A convenience function to tune the ICU Collator's behavior,
e.g., in stri_compare, stri_order,
stri_unique, stri_duplicated,
as well as stri_detect_coll
and other stringi-search-coll functions.
Usage
stri_opts_collator(
locale = NULL,
strength = 3L,
alternate_shifted = FALSE,
french = FALSE,
uppercase_first = NA,
case_level = FALSE,
normalization = FALSE,
normalisation = normalization,
numeric = FALSE
)
stri_coll(
locale = NULL,
strength = 3L,
alternate_shifted = FALSE,
french = FALSE,
uppercase_first = NA,
case_level = FALSE,
normalization = FALSE,
normalisation = normalization,
numeric = FALSE
)Arguments
- locale
single string,
NULLor''for default locale- strength
single integer in {1,2,3,4}, which defines collation strength;
1for the most permissive collation rules,4for the strictest ones- alternate_shifted
single logical value;
FALSEtreats all the code points with non-ignorable primary weights in the same way,TRUEcauses code points with primary weights that are equal or below the variable top value to be ignored on primary level and moved to the quaternary level- french
single logical value; used in Canadian French;
TRUEresults in secondary weights being considered backwards- uppercase_first
single logical value;
NAorders upper and lower case letters in accordance to their tertiary weights,TRUEforces upper case letters to sort before lower case letters,FALSEdoes the opposite- case_level
single logical value; controls whether an extra case level (positioned before the third level) is generated or not
- normalization
single logical value; if
TRUE, then incremental check is performed to see whether the input data is in the FCD form. If the data is not in the FCD form, incremental NFD normalization is performed- normalisation
alias of
normalization- numeric
single logical value; when turned on, this attribute generates a collation key for the numeric value of substrings of digits; this is a way to get '100' to sort AFTER '2'; note that negative or non-integer numbers will not be ordered properly
Details
ICU's collator performs a locale-aware, natural-language alike string comparison. This is a more reliable way of establishing relationships between strings than the one provided by base R, and definitely one that is more complex and appropriate than ordinary bytewise comparison.
References
Collation – ICU User Guide, https://unicode-org.github.io/icu/userguide/collation/
ICU Collation Service Architecture – ICU User Guide, https://unicode-org.github.io/icu/userguide/collation/architecture.html
icu::Collator Class Reference – ICU4C API Documentation,
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html
See also
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02
Other locale_sensitive:
%s<%(),
about_locale,
about_search_boundaries,
about_search_coll,
stri_compare(),
stri_count_boundaries(),
stri_duplicated(),
stri_enc_detect2(),
stri_extract_all_boundaries(),
stri_locate_all_boundaries(),
stri_order(),
stri_rank(),
stri_sort(),
stri_sort_key(),
stri_split_boundaries(),
stri_trans_tolower(),
stri_unique(),
stri_wrap()
Other search_coll:
about_search,
about_search_coll
Author
Marek Gagolewski and other contributors
Examples
stri_cmp('number100', 'number2')
#> [1] -1
stri_cmp('number100', 'number2', opts_collator=stri_opts_collator(numeric=TRUE))
#> [1] 1
stri_cmp('number100', 'number2', numeric=TRUE) # equivalent
#> [1] 1
stri_cmp('above mentioned', 'above-mentioned')
#> [1] -1
stri_cmp('above mentioned', 'above-mentioned', alternate_shifted=TRUE)
#> [1] 0