| nc: named capture | ![]() |
| tests | |
| coverage |
User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting/reshaping columns that match a regular expression. Please read and cite my related R Journal papers, if you use this code!
- Comparing namedCapture with other R packages for regular expressions (2019).
- Wide-to-tall Data Reshaping Using Regular Expressions and the nc Package (2021).
fruit.vec <- c("granny smith apple", "blood orange and yellow banana")
fruit.pattern <- list(type=".*?", " ", fruit="orange|apple|banana")
nc::capture_first_vec(fruit.vec, fruit.pattern)
#> type fruit
#> 1: granny smith apple
#> 2: blood orange
nc::capture_all_str(fruit.vec, fruit.pattern)
#> type fruit
#> 1: granny smith apple
#> 2: blood orange
#> 3: and yellow banana(one.iris <- iris[1,])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
nc::capture_melt_single(one.iris, part=".*", "[.]", dim=".*")
#> Species part dim value
#> 1: setosa Sepal Length 5.1
#> 2: setosa Sepal Width 3.5
#> 3: setosa Petal Length 1.4
#> 4: setosa Petal Width 0.2
nc::capture_melt_multiple(one.iris, part=".*", "[.]", column=".*")
#> Species part Length Width
#> 1: setosa Petal 1.4 0.2
#> 2: setosa Sepal 5.1 3.5
nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim=".*")
#> Species dim Petal Sepal
#> 1: setosa Length 1.4 5.1
#> 2: setosa Width 0.2 3.5install.packages("nc")
## or:
if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/nc")Watch the screencast tutorial videos!
The main functions provided in nc are:
| Subject | nc function | Similar to | And |
|---|---|---|---|
| Single string | capture_all_str | stringr::str_match_all | rex::re_matches |
| Character vector | capture_first_vec | stringr::str_match | rex::re_matches |
| Data frame chr cols | capture_first_df | tidyr::extract/separate_wider_regex | data.table::tstrsplit |
| Data frame col names | capture_melt_single | tidyr::pivot_longer | data.table::melt |
| Data frame col names | capture_melt_multiple | tidyr::pivot_longer | data.table::melt |
| File paths | capture_first_glob | arrow::open_dataset |
- Vignette 0 provides an overview of the various functions.
- Vignette 1 discusses
capture_first_vecandcapture_first_df, which capture the first match in each of several subjects (character vector, data frame character columns). - Vignette 2 discusses
capture_all_strwhich captures all matches in a single string, or a single multi-line text file. The vignette also shows how to usecapture_all_stron several different strings/files, using data.tablebysyntax. - Vignette 3 discusses
capture_melt_singleandcapture_melt_multiplewhich match a regex to the column names of a wide data frame, then melt/reshape the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column namesPetal.Width,Sepal.Width, etc each have two pieces of information (flower part and measurement dimension). - Vignette 4 shows comparisons with related R packages.
- Vignette 5 explains how to use helper functions for creating complex regular expressions.
- Vignette 6 explains how to use different regex engines.
- Vignette 7 explains how to read regularly named files, and use a
regex to extract meta-data from the file names, using
nc::capture_first_glob.
By default, nc uses PCRE. Other options include ICU and RE2.
To tell nc that you would like to use a certain engine,
options(nc.engine="RE2")Every function also has an engine argument, e.g.
nc::capture_first_vec(
"foo a\U0001F60E# bar",
before=".*?",
emoji="\\p{EMOJI_Presentation}",
after=".*",
engine="ICU")
#> before emoji after
#> 1 foo a 😎 # barFor an detailed comparison of regex C libraries in R (ICU, PCRE, TRE, RE2), see my R journal (2019) paper about namedCapture.
The nc reshaping functions provide functionality similar to packages
tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The
main difference is that nc::capture_melt_* support named capture
regular expressions with type conversion, which (1) makes it easier to
create/maintain a complex regex, and (2) results in less repetition in
user code. For a detailed comparison, see my R Journal (2021) paper
about nc.
Below I list the main
differences between the functions in nc and other analogous R functions:
- Main
ncfunctions all have thecapture_prefix for easy auto-completion. - Output in
ncis always a data.table (other packages output either a list, character matrix, or data frame). - For memory efficiency,
nc::capture_first_dfmodifies the input if it is a data table, whereastidyrfunctions always copy the input table. - By default the
nc::capture_first_vecstops with an error if any subjects do not match, whereas other functions return NA/missing rows. nc::capture_all_stronly supports capturing multiple matches in a single subject (returning a data table), whereas other functions support multiple subjects (and return list of character matrices). For handling multiple subjects usingnc, useDT[, nc::capture_all_str(subject), by](see vignette 2 for more info).nc::capture_melt_singleandnc::capture_melt_multipleuse regex for wide-to-tall data reshaping, see Vignette 3 and my R Journal (2021) paper for more info. Whereas in nc these are two separate functions, other packages typically provide a single function which does both kinds of reshaping, for example measure indata.table.nc::capture_first_globis for reading any kind of regularly named files into R using regex, whereasarrow::open_datasetrequires a particular naming scheme (does not support regex).- Helper function
nc::measurecan be used to create themeasure.varsargument ofdata.table::melt, andnc::capture_longer_speccan be used to create thespecargument oftidyr::pivot_longer. This can be useful if you want to use nc to define the regex, but you want to use the other package functions to do the reshape. - Similar to rex::capture, helper function
nc::fieldis provided for defining patterns that match subjects like variable=value, and create a column/group named variable (useful to avoid repeating variable names in regex code). See vignette 2 for more info. - Similar to rex::or,
nc::alternatives_with_shared_groupsis provided for defining a pattern containing alternatives with shared groups. See vignette 5 for more info.
