A convenience function to tune the ICU BreakIterator's behavior
in some text boundary analysis functions, see
stringi-search-boundaries.
Usage
stri_opts_brkiter(
type,
locale,
skip_word_none,
skip_word_number,
skip_word_letter,
skip_word_kana,
skip_word_ideo,
skip_line_soft,
skip_line_hard,
skip_sentence_term,
skip_sentence_sep
)Arguments
- type
single string; either the break iterator type, one of
character,line_break,sentence,word, or a custom set of ICU break iteration rules; see stringi-search-boundaries- locale
single string,
NULLor''for default locale- skip_word_none
logical; perform no action for 'words' that do not fit into any other categories
- skip_word_number
logical; perform no action for words that appear to be numbers
- skip_word_letter
logical; perform no action for words that contain letters, excluding hiragana, katakana, or ideographic characters
- skip_word_kana
logical; perform no action for words containing kana characters
- skip_word_ideo
logical; perform no action for words containing ideographic characters
- skip_line_soft
logical; perform no action for soft line breaks, i.e., positions where a line break is acceptable but not required
- skip_line_hard
logical; perform no action for hard, or mandatory line breaks
- skip_sentence_term
logical; perform no action for sentences ending with a sentence terminator ('
.', ',', '?', '!'), possibly followed by a hard separator (CR,LF,PS, etc.)- skip_sentence_sep
logical; perform no action for sentences that do not contain an ending sentence terminator, but are ended by a hard separator or end of input
Details
The skip_* family of settings may be used to prevent performing
any special actions on particular types of text boundaries, e.g.,
in case of the stri_locate_all_boundaries and
stri_split_boundaries functions.
Note that custom break iterator rules (advanced users only) should be specified as a single string. For a detailed description of the syntax of RBBI rules, please refer to the ICU User Guide on Boundary Analysis.
References
ubrk.h File Reference – ICU4C API Documentation,
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ubrk_8h.html
Boundary Analysis – ICU User Guide, https://unicode-org.github.io/icu/userguide/boundaryanalysis/
See also
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02
Other text_boundaries:
about_search,
about_search_boundaries,
stri_count_boundaries(),
stri_extract_all_boundaries(),
stri_locate_all_boundaries(),
stri_split_boundaries(),
stri_split_lines(),
stri_trans_tolower(),
stri_wrap()
Author
Marek Gagolewski and other contributors