This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.
Usage
stri_split_boundaries(
str,
n = -1L,
tokens_only = FALSE,
simplify = FALSE,
...,
opts_brkiter = NULL
)Arguments
- str
character vector or an object coercible to
- n
integer vector, maximal number of strings to return
- tokens_only
single logical value; may affect the result if
nis positive, see Details- simplify
single logical value; if
TRUEorNA, then a character matrix is returned; otherwise (the default), a list of character vectors is given, see Value- ...
additional settings for
opts_brkiter- opts_brkiter
a named list with ICU BreakIterator's settings, see
stri_opts_brkiter;NULLfor the default break iterator, i.e.,line_break
Value
If simplify=FALSE (the default),
then the functions return a list of character vectors.
Otherwise, stri_list2matrix with byrow=TRUE
and n_min=n arguments is called on the resulting object.
In such a case, a character matrix with length(str) rows
is returned. Note that stri_list2matrix's fill
argument is set to an empty string and NA,
for simplify equal to TRUE and NA, respectively.
Details
Vectorized over str and n.
If n is negative (the default), then all text pieces are extracted.
Otherwise, if tokens_only is FALSE (which is the default),
then n-1 tokens are extracted (if possible) and the n-th string
gives the (non-split) remainder (see Examples).
On the other hand, if tokens_only is TRUE,
then only full tokens (up to n pieces) are extracted.
For more information on text boundary analysis
performed by ICU's BreakIterator, see
stringi-search-boundaries.
See also
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02
Other search_split:
about_search,
stri_split(),
stri_split_lines()
Other locale_sensitive:
%s<%(),
about_locale,
about_search_boundaries,
about_search_coll,
stri_compare(),
stri_count_boundaries(),
stri_duplicated(),
stri_enc_detect2(),
stri_extract_all_boundaries(),
stri_locate_all_boundaries(),
stri_opts_collator(),
stri_order(),
stri_rank(),
stri_sort(),
stri_sort_key(),
stri_trans_tolower(),
stri_unique(),
stri_wrap()
Other text_boundaries:
about_search,
about_search_boundaries,
stri_count_boundaries(),
stri_extract_all_boundaries(),
stri_locate_all_boundaries(),
stri_opts_brkiter(),
stri_split_lines(),
stri_trans_tolower(),
stri_wrap()
Author
Marek Gagolewski and other contributors
Examples
test <- 'The\u00a0above-mentioned features are very useful. ' %s+%
'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
#> [[1]]
#> [1] "The above-" "mentioned " "features " "are "
#> [5] "very " "useful. " "Spam, " "spam, "
#> [9] "eggs, " "bacon, " "and " "spam. "
#> [13] "123 " "456 " "789"
#>
stri_split_boundaries(test, type='word')
#> [[1]]
#> [1] "The" " " "above" "-" "mentioned" " "
#> [7] "features" " " "are" " " "very" " "
#> [13] "useful" "." " " "Spam" "," " "
#> [19] "spam" "," " " "eggs" "," " "
#> [25] "bacon" "," " " "and" " " "spam"
#> [31] "." " " "123" " " "456" " "
#> [37] "789"
#>
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
#> [[1]]
#> [1] "The" "above" "mentioned" "features" "are" "very"
#> [7] "useful" "Spam" "spam" "eggs" "bacon" "and"
#> [13] "spam" "123" "456" "789"
#>
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
#> [[1]]
#> [1] "123" "456" "789"
#>
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
#> [[1]]
#> [1] "The" "above" "mentioned" "features" "are" "very"
#> [7] "useful" "Spam" "spam" "eggs" "bacon" "and"
#> [13] "spam"
#>
stri_split_boundaries(test, type='sentence')
#> [[1]]
#> [1] "The above-mentioned features are very useful. "
#> [2] "Spam, spam, eggs, bacon, and spam. "
#> [3] "123 456 789"
#>
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
#> [[1]]
#> [1] "The above-mentioned features are very useful. "
#> [2] "Spam, spam, eggs, bacon, and spam. "
#>
stri_split_boundaries(test, type='character')
#> [[1]]
#> [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n" "e" "d"
#> [20] " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a" "r" "e" " " "v" "e"
#> [39] "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "S" "p" "a" "m" "," " " "s" "p"
#> [58] "a" "m" "," " " "e" "g" "g" "s" "," " " "b" "a" "c" "o" "n" "," " " "a" "n"
#> [77] "d" " " "s" "p" "a" "m" "." " " "1" "2" "3" " " "4" "5" "6" " " "7" "8" "9"
#>
# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only
#> [[1]]
#> [1] "Mr. Jones and Mrs. Brown are very happy.\n"
#> [2] "So am I, Prof. Smith."
#>