Character String Editing and Miscellaneous Character Handling Functions

This suite of functions was written to implement many of the features of the UNIX sed program entirely within S (function sedit). The substring.location function returns the first and last position numbers that a sub-string occupies in a larger string. The substring2<- function does the opposite of the builtin function substring. It is named substring2 because for S-Plus there is a built-in function substring, but it does not handle multiple replacements in a single string. replace.substring.wild edits character strings in the fashion of "change xxxxANYTHINGyyyy to aaaaANYTHINGbbbb", if the "ANYTHING" passes an optional user-specified test function. Here, the "yyyy" string is searched for from right to left to handle balancing parentheses, etc. numeric.string and all.digits are two examples of test functions, to check, respectively if each of a vector of strings is a legal numeric or if it contains only the digits 0-9. For the case where old="*$" or "^*", or for replace.substring.wild with the same values of old or with front=TRUE or back=TRUE, sedit (if wild.literal=FALSE) and replace.substring.wild will edit the largest substring satisfying test.

substring2 is just a copy of substring so that substring2<- will work.

Usage

sedit(text, from, to, test, wild.literal=FALSE)
substring.location(text, string, restrict)
# substring(text, first, last) <- setto   # S-Plus only
replace.substring.wild(text, old, new, test, front=FALSE, back=FALSE)
numeric.string(string)
all.digits(string)
substring2(text, first, last)
substring2(text, first, last) <- value

Arguments

text: a vector of character strings for sedit, substring2, substring2<- or a single character string for substring.location, replace.substring.wild.
from: a vector of character strings to translate from, for sedit. A single asterisk wild card, meaning allow any sequence of characters (subject to the test function, if any) in place of the "*". An element of from may begin with "^" to force the match to begin at the beginning of text, and an element of from can end with "$" to force the match to end at the end of text.
to: a vector of character strings to translate to, for sedit. If a corresponding element in from had an "*", the element in to may also have an "*". Only single asterisks are allowed. If to is not the same length as from, the rep function is used to make it the same length.
string: a single character string, for substring.location, numeric.string, all.digits
first: a vector of integers specifying the first position to replace for substring2<-. first may also be a vector of character strings that are passed to sedit to use as patterns for replacing substrings with setto. See one of the last examples below.
last: a vector of integers specifying the ending positions of the character substrings to be replaced. The default is to go to the end of the string. When first is character, last must be omitted.
setto: a character string or vector of character strings used as replacements, in substring2<-
old: a character string to translate from for replace.substring.wild. May be "*$" or "^*" or any string containing a single "*" but not beginning with "^" or ending with "$".
new: a character string to translate to for replace.substring.wild
test: a function of a vector of character strings returning a logical vector whose elements are TRUE or FALSE according to whether that string element qualifies as the wild card string for sedit, replace.substring.wild
wild.literal: set to TRUE to not treat asterisks as wild cards and to not look for "^" or "$" in old
restrict: a vector of two integers for substring.location which specifies a range to which the search for matches should be restricted
front: specifying front = TRUE and old = "*" is the same as specifying old = "^*"
back: specifying back = TRUE and old = "*" is the same as specifying old = "*$"
value: a character vector

Value

sedit returns a vector of character strings the same length as text. substring.location returns a list with components named first and last, each specifying a vector of character positions corresponding to matches. replace.substring.wild returns a single character string. numeric.string and all.digits return a single logical value.

Side Effects

substring2<- modifies its first argument

Author

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

Examples

x <- 'this string'
substring2(x, 3, 4) <- 'IS'
x
#> [1] "thIS string"
substring2(x, 7) <- ''
x
#> [1] "thIS s"


substring.location('abcdefgabc', 'ab')
#> $first
#> [1] 1 8
#> 
#> $last
#> [1] 2 9
#> 
substring.location('abcdefgabc', 'ab', restrict=c(3,999))
#> $first
#> [1] 8
#> 
#> $last
#> [1] 9
#> 


replace.substring.wild('this is a cat','this*cat','that*dog')
#> [1] "that is a dog"
replace.substring.wild('there is a cat','is a*', 'is not a*')
#> [1] "there is not a cat"
replace.substring.wild('this is a cat','is a*', 'Z')
#> [1] "this Z"


qualify <- function(x) x==' 1.5 ' | x==' 2.5 '
replace.substring.wild('He won 1.5 million $','won*million',
                       'lost*million', test=qualify)
#> [1] "He lost 1.5 million $"
replace.substring.wild('He won 1 million $','won*million',
                       'lost*million', test=qualify)
#> [1] "He won 1 million $"
replace.substring.wild('He won 1.2 million $','won*million',
                       'lost*million', test=numeric.string)
#> [1] "He lost 1.2 million $"


x <- c('a = b','c < d','hello')
sedit(x, c('=','he*o'),c('==','he*'))
#> [1] "a == b" "c < d"  "hell"  


sedit('x23', '*$', '[*]', test=numeric.string)
#> [1] "x[23]"
sedit('23xx', '^*', 'Y_{*} ', test=all.digits)
#> [1] "Y_{23} xx"


replace.substring.wild("abcdefabcdef", "d*f", "xy")
#> [1] "abcxy"


x <- "abcd"
substring2(x, "bc") <- "BCX"
x
#> [1] "aBCXd"
substring2(x, "B*d") <- "B*D"
x
#> [1] "aBCXD"