Extract Data From First Regular Expression Match Into a Data Frame

Match a regular expression to a string, and return matches, match positions, and capture groups. This function is like its match counterpart, except it returns match/capture group start and end positions in addition to the matched values.

re_exec(text, pattern, perl = TRUE, ...)

# S3 method for class 'rematch_records'
x$name

# S3 method for class 'rematch_allrecords'
x$name

Arguments

text: Character vector.
pattern: A regular expression. See regex for more about regular expressions.
perl: logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups.
...: Additional arguments to pass to gregexpr (or regexpr if text is of length zero).
x: Object returned by re_exec or re_exec_all.
name: match, start or end.

Value

A tidy data frame (see Section “Tidy Data”). Match record entries are one length vectors that are set to NA if there is no match.

Tidy Data

The return value is a tidy data frame where each row corresponds to an element of the input character vector text. The values from text appear for reference in the .text character column. All other columns are list columns containing the match data. The .match column contains the match information for full regular expression matches while other columns correspond to capture groups if there are any, and PCRE matches are enabled with perl = TRUE (this is on by default). If capture groups are named the corresponding columns will bear those names.

Each match data column list contains match records, one for each element in text. A match record is a named list, with entries match, start and end that are respectively the matching (sub) string, the start, and the end positions (using one based indexing).

Extracting Match Data

To make it easier to extract matching substrings or positions, a special $ operator is defined on match columns, both for the .match column and the columns corresponding to the capture groups. See examples below.

Examples

name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
# Match first occurrence
pos <- re_exec(notables, name_rex)
pos
#> # A tibble: 2 × 4
#>   first            last             .text                           .match      
#>   <rmtch_rc>       <rmtch_rc>       <chr>                           <rmtch_rc>  
#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>

# Custom $ to extract matches and positions
pos$first$match
#> [1] "Ben"     "Millard"
pos$first$start
#> [1] 3 2
pos$first$end
#> [1] 5 8