Character String Editing and Miscellaneous Character Handling Functions
sedit.RdThis suite of functions was written to implement many of the features
of the UNIX sed program entirely within S (function sedit).
The substring.location function returns the first and last position
numbers that a sub-string occupies in a larger string. The substring2<-
function does the opposite of the builtin function substring.
It is named substring2 because for S-Plus there is a built-in
function substring, but it does not handle multiple replacements in
a single string.
replace.substring.wild edits character strings in the fashion of
"change xxxxANYTHINGyyyy to aaaaANYTHINGbbbb", if the "ANYTHING"
passes an optional user-specified test function. Here, the
"yyyy" string is searched for from right to left to handle
balancing parentheses, etc. numeric.string
and all.digits are two examples of test functions, to check,
respectively if each of a vector of strings is a legal numeric or if it contains only
the digits 0-9. For the case where old="*$" or "^*", or for
replace.substring.wild with the same values of old or with
front=TRUE or back=TRUE, sedit (if wild.literal=FALSE) and
replace.substring.wild will edit the largest substring
satisfying test.
substring2 is just a copy of substring so that
substring2<- will work.
Usage
sedit(text, from, to, test, wild.literal=FALSE)
substring.location(text, string, restrict)
# substring(text, first, last) <- setto # S-Plus only
replace.substring.wild(text, old, new, test, front=FALSE, back=FALSE)
numeric.string(string)
all.digits(string)
substring2(text, first, last)
substring2(text, first, last) <- valueArguments
- text
a vector of character strings for
sedit, substring2, substring2<-or a single character string forsubstring.location, replace.substring.wild.- from
a vector of character strings to translate from, for
sedit. A single asterisk wild card, meaning allow any sequence of characters (subject to thetestfunction, if any) in place of the"*". An element offrommay begin with"^"to force the match to begin at the beginning oftext, and an element offromcan end with"$"to force the match to end at the end oftext.- to
a vector of character strings to translate to, for
sedit. If a corresponding element infromhad an"*", the element intomay also have an"*". Only single asterisks are allowed. Iftois not the same length asfrom, therepfunction is used to make it the same length.- string
a single character string, for
substring.location,numeric.string,all.digits- first
a vector of integers specifying the first position to replace for
substring2<-.firstmay also be a vector of character strings that are passed toseditto use as patterns for replacing substrings withsetto. See one of the last examples below.- last
a vector of integers specifying the ending positions of the character substrings to be replaced. The default is to go to the end of the string. When
firstis character,lastmust be omitted.- setto
a character string or vector of character strings used as replacements, in
substring2<-- old
a character string to translate from for
replace.substring.wild. May be"*$"or"^*"or any string containing a single"*"but not beginning with"^"or ending with"$".- new
a character string to translate to for
replace.substring.wild- test
a function of a vector of character strings returning a logical vector whose elements are
TRUEorFALSEaccording to whether that string element qualifies as the wild card string forsedit, replace.substring.wild- wild.literal
set to
TRUEto not treat asterisks as wild cards and to not look for"^"or"$"inold- restrict
a vector of two integers for
substring.locationwhich specifies a range to which the search for matches should be restricted- front
specifying
front = TRUEandold = "*"is the same as specifyingold = "^*"- back
specifying
back = TRUEandold = "*"is the same as specifyingold = "*$"- value
a character vector
Value
sedit returns a vector of character strings the same length as text.
substring.location returns a list with components named first
and last, each specifying a vector of character positions corresponding
to matches. replace.substring.wild returns a single character string.
numeric.string and all.digits return a single logical value.
Author
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
Examples
x <- 'this string'
substring2(x, 3, 4) <- 'IS'
x
#> [1] "thIS string"
substring2(x, 7) <- ''
x
#> [1] "thIS s"
substring.location('abcdefgabc', 'ab')
#> $first
#> [1] 1 8
#>
#> $last
#> [1] 2 9
#>
substring.location('abcdefgabc', 'ab', restrict=c(3,999))
#> $first
#> [1] 8
#>
#> $last
#> [1] 9
#>
replace.substring.wild('this is a cat','this*cat','that*dog')
#> [1] "that is a dog"
replace.substring.wild('there is a cat','is a*', 'is not a*')
#> [1] "there is not a cat"
replace.substring.wild('this is a cat','is a*', 'Z')
#> [1] "this Z"
qualify <- function(x) x==' 1.5 ' | x==' 2.5 '
replace.substring.wild('He won 1.5 million $','won*million',
'lost*million', test=qualify)
#> [1] "He lost 1.5 million $"
replace.substring.wild('He won 1 million $','won*million',
'lost*million', test=qualify)
#> [1] "He won 1 million $"
replace.substring.wild('He won 1.2 million $','won*million',
'lost*million', test=numeric.string)
#> [1] "He lost 1.2 million $"
x <- c('a = b','c < d','hello')
sedit(x, c('=','he*o'),c('==','he*'))
#> [1] "a == b" "c < d" "hell"
sedit('x23', '*$', '[*]', test=numeric.string)
#> [1] "x[23]"
sedit('23xx', '^*', 'Y_{*} ', test=all.digits)
#> [1] "Y_{23} xx"
replace.substring.wild("abcdefabcdef", "d*f", "xy")
#> [1] "abcxy"
x <- "abcd"
substring2(x, "bc") <- "BCX"
x
#> [1] "aBCXd"
substring2(x, "B*d") <- "B*D"
x
#> [1] "aBCXD"