ds-environ-he/07-string-processing-and-debugging.Rmd at main · hunter-everton/ds-environ-he · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# String Processing & Debugging

## Lecture summary
Practical **string processing** with base R and the tidyverse (`stringr`) plus core **debugging** strategies. We cover regular expressions (regex), pattern detection and extraction, and robust text-cleaning pipelines. Then we practice reading errors and tracebacks, using interactive debuggers (`browser()`, `debugonce()`), and applying a reproducible debugging checklist. The lab applies regex to real-world text and steps through common failure modes.

### Learning objectives

By the end of this chapter you will be able to:

- Recognize and write common **regex** patterns (IDs, dates, codes) and test them.
- Use `stringr` verbs (`str_detect`, `str_extract`, `str_match`, `str_replace`, `str_split`) fluently in pipelines.
- Build tidy **text-cleaning** workflows that parse, reshape, and standardize messy columns.
- Diagnose and fix errors using **traceback**, **rlang::last_trace()**, and interactive tools.
- Apply a **debugging checklist** to isolate minimal failing examples and document fixes.

```{r setup, include=FALSE}
library(knitr)
library(stringr)
```


### String manipulation in base R

A few of the basic R functions for manipulating strings are *paste*, *strsplit*, and *substring*. *paste* and *strsplit* are basically inverses of each other: *paste* concatenates together an arbitrary set of strings (or a vector, if using the *collapse*
argument) with a user-specified separator character, while *strsplit* splits apart based on a delimiter/separator. *substring* splits apart the elements of a character vector based on fixed widths. *nchar* returns the number of characters in a string. Note that all of these operate in a vectorized fashion.

```{r, r-basics}
out <- paste("My", "name", "is", "Lauren", ".", sep = " ")
out

## split the string by spaces
strsplit(out, split = ' ')

## count the total number of characters in my string
nchar(out)
```
Note that *strsplit* returns a list because it can operate on a character vector (i.e., on multiple strings).


```{r, r-basics-2}
times <- c("04:18:04", "12:12:53", "13:47:00")
time.pieces <- strsplit(times, ":")
time.pieces

sapply(time.pieces, length)

sapply(time.pieces, function(x) x[3])

```


*substring* takes the start and end element number to extract or replace.

```{r, r-basics-3}
times <- c("04:18:04", "12:12:53", "13:47:00")
substring(times, 7, 8)

substring(times[3], 1, 2) <- '01'   ## replacement
times
```


To identify particular subsequences in strings, there are several related R functions. *grep* will look for a specified string within an R character vector and report back indices identifying the elements of the vector in which the string was found. Note that using the `fixed=TRUE` argument ensures that regular expressions are NOT used. *grepl* will return TRUEs and FALSEs if a pattern is found within a string.

*gregexpr* will indicate the position in each string that the specified string is found (use *regexpr* if you only want the first occurrence).

*gsub* can be used to replace a specified string with a replacement string (use *sub* if you only want to replace only the first occurrence).

```{r, r-pattern-1}
dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
grep("2016", dates)
grepl("2016", dates)

## start/end position of each match
gregexpr("2016", dates)
gsub("2016", "16", dates)
```

### String manipulation using *stringr*

The *stringr* package wraps the various core string manipulation functions to provide a common interface. It also removes some of the clunkiness involved in some of the string operations with the base string functions, such as having to to call *gregexpr* and then *regmatches* to pull out the matched strings. For anything but very simple takss, I'd suggest using *stringr* functions in place of R's base string functions.

Table 1 provides an overview of the key functions related to working with patterns, which are basically
wrappers for *grep*, *gsub*, *gregexpr*, etc.


|  Function                         | What it does
|-----------------------------------|---------------------------------------------------------------------
| str_detect                        |                 detects pattern, returning TRUE/FALSE
| str_count                         |                 counts matches
| str_locate/str_locate_all         |                  detects pattern, returning positions of matching characters
| str_extract/str_extract_all       |                  detects pattern, returning matches
| str_replace/str_replace_all       |                  detects pattern and replaces matches

The analog of *regexpr* vs. *gregexpr* and *sub* vs. *gsub* is that most of the functions have versions that return all the matches, not just the first match, e.g. *str_locate_all* *str_extract_all*, etc. Note that the *_all* functions return lists while the non-*_all* functions return vectors.

To specify options, you can wrap these functions around the pattern argument: `fixed(pattern, ignore_case)` and `regex(pattern, ignore_case)`. The default is *regex*, so you only need to specify that if you also want to specify additional arguments, such as *ignore_case* or others listed under `help(regex)` (invoke the help after loading *stringr*)

Let's see *stringr*'s versions of some of the base string functions mentioned in the previous sections.

```{r, r-stringr-1}
str <- c("Apple Computer", "IBM", "Apple apps")

str_locate(str, fixed("app", ignore_case = TRUE))
## Not just the first
str_locate_all(str, fixed("app", ignore_case = TRUE))

dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
## regular expression: years begin in 2010
str_locate(dates, "20[^0][0-9]")

```

The basic interface to *stringr* functions is `function(strings, pattern, [replacement])`.

### Regular expressions (regex/regexp)

Regular expressions are a domain-specific language for finding patterns and are one of the key functionalities in scripting languages such as Perl and Python, as well as the UNIX utilities *sed*, *awk* and *grep*.

The basic idea of regular expressions is that they allow us to find matches of strings or patterns in strings, as well as do substitution.

Regular expressions are good for tasks such as:
 - extracting pieces of text - for example finding all the links in an html document;
 -  cleaning and transforming text (ex. values in a column) into a uniform format;
 -  creating variables from information found in text;
 -  mining text by treating documents as data; and
 -  scraping the web for data.

Also, here's a [cheatsheet on regular expressions](https://github.com/rstudio/cheatsheets/blob/main/regex.pdf) and here is a [website where you can interactively test regular expressions on example strings](https://regex101.com).

**Versions of regular expressions**

One thing that can cause headaches is differences in version of regular expression syntax used.  As can be seen in `help(regex)`, In R, *stringr* provides *ICU regular expressions*, which are based on Perl regular expressions. More details can be found in the [regex Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression).

**Commonly used regex building blocks**
Square brackets can be used to define a list or range of characters to be found. So:

- `[ABC]` matches A or B or C.
- `[A-Z]` matches any upper case letter.
- `[A-Za-z]` matches any upper or lower case letter.
- `[A-Za-z0-9]` matches any upper or lower case letter or any digit.
- `[:digit:]` matches any digit

Then there are:

- `.` matches any character.
- `\d` matches any single digit.
- `\w` matches any part of word character (equivalent to `[A-Za-z0-9]`).
- `\s` matches any space, tab, or newline.
- `\` used to escape the following character when that character is a special character. So, for example, a regular expression that found `.com` would be `\.com` because `.` is a special character that matches any character.
- `^` is an "anchor" which asserts the position at the start of the line. So what you put after the caret will only match if they are the first characters of a line.
- `$` is an "anchor" which asserts the position at the end of the line. So what you put before it will only match if they are the last characters of a line.

- `\b` asserts that the pattern must match at a word boundary.
Putting this either side of a word stops the regular expression matching longer variants of words. So:
  - the regular expression `mark` will match not only `mark` but also find `marking`, `market`, `unremarkable`, and so on.
  - the regular expression `\bword` will match `word`, `wordless`, and `wordlessly`.
  - the regular expression `comb\b` will match `comb` and `honeycomb` but not `combine`.
  - the regular expression `\brespect\b` will match `respect` but not `respectable` or `disrespectful`.

Other useful special characters are:

- `*` matches the preceding element zero or more times. For example, ab\*c matches "ac", "abc", "abbbc", etc.
- `+` matches the preceding element one or more times. For example, ab+c matches "abc", "abbbc" but not "ac".
- `?` matches when the preceding character appears zero or one time.
- `{VALUE}` matches the preceding character the number of times defined by VALUE; ranges, say, 1-6, can be specified with the syntax `{VALUE,VALUE}`, e.g. `\d{1,9}` will match any number between one and nine digits in length.
- `|` means **or**.

### General principles for working with regex

The syntax is very concise, so it's helpful to break down individual regular expressions into the component parts to understand them.  Since regex are their own language, it's a good idea to build up a regex in pieces as a way of avoiding errors just as we would with any computer code. *str_detect* in R's *stringr* is particularly useful in seeing *what* was matched to help in understanding and learning regular expression syntax and debugging your regex.

The *grep*, *gregexpr* and *gsub* functions and their *stringr* analogs are more powerful when used with regular expressions. In the following examples, we'll illustrate usage of *stringr* functions, but  with their base R analogs as comments.

### Working with patterns

First let's see the use of character sets and character classes.

```{r, detect-reg-1}
text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
          "They bought 731 bananas", "Please call 919.554.3800")
str_detect(text, "[[:digit:]]")

## grep("[[:digit:]]", text, perl = TRUE)

```

```{r, detect-reg-2}
# Match a single character present in the list below [:,\t.]
str_detect(text, "[:,\t.]")
## grep("[:,\t.]", text)

str_locate_all(text, "[:,\t.]")
## gregexpr("[:,\t.]", text)

# + matches the previous token between one to unlimited times
# [:digit:] matches a digit [0-9] (also written as \d)

str_extract_all(text, "[[:digit:]]+")
## matches <- gregexpr("[[:digit]]+", text)
## regmatches(text, matches)

str_replace_all(text, "[[:digit:]]", "Z")
## gsub("[[:digit:]]", "Z", text)

```

## In class challenge: What will the regular expression `^[Oo]rgani.e\b` match?

```{r, class-challenge-1}

```


Now let's make use of repetitions.

Let's search for US/Canadian/Caribbean phone numbers in the example text we've been using:


```{r, repetitions-1}
text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
          "They bought 731 bananas", "Please call 919.554.3800")
pattern <- "[[:digit:]]{3}[-.][[:digit:]]{3}[-.][[:digit:]]{4}"
str_extract_all(text, pattern)
## matches <- gregexpr(pattern, text)
## regmatches(text, matches)
```

## In class challenge 2: How would I extract an email address from an arbitrary text string?

```{r, class-challenge-2}
```

### Groups

- Parentheses () in a regular expression define a capturing group.

- The backreference \\1 refers to the first capturing group in the same regex.

For example, here we'll find any numbers and add underscores before and after them:

```{r, references-basic}
text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
          "They bought 731 bananas", "Please call 919.554.3800")

# Match a single character present in the list below [0-9]
# + matches the previous token between one to unlimited times
# 0-9 matches a single character in the range between 0

str_replace_all(text, "([0-9]+)", "_\\1_")

```

In class challenge 3: Suppose a text string has dates in the form “Aug-3”, “May-9”, etc. and I want them in the form “3 Aug”, “9 May”, etc. How would I do this search/replace?

### Other comments

Regular expression can be used in a variety of places. E.g., to split by any number of white space characters

```{r, split-1}
line <- "a dog\tjumped\nover \tthe moon."
cat(line)
str_split(line, "[[:space:]]+")
str_split(line, "[[:blank:]]+")

```

Using backslashes to 'escape' particular characters can be tricky. One rule of thumb is to just keep adding backslashes until you get what you want!

```{r, escaping-1}
## last case here is literally a backslash and then 'n'
strings <- c("Hello", "Hello.", "Hello\nthere", "Hello\\nthere")
cat(strings, sep = "\n")

str_detect(strings, ".")           ## . means any character
## str_detect(strings, "\.")       ## \. looks for the special symbol \.
str_detect(strings, "\\.")         ## \\ says treat \ literally, which then escapes the .
str_detect(strings, "\n")          ## \n looks for the special symbol \n
## str_detect(strings, "\\")       ## \\ says treat \ literally, but \ is not meaningful regex
str_detect(strings, "\\\\")        ## R parser removes two \ to give \\; then in regex \\ treats second \ literally

```


### Additional resources

- Wickham (2019) *R for Data Science*, Chapters on Strings & Regular Expressions.
- RStudio cheatsheets: **Stringr**, **Regex**, **Debugging**.

```{r child="readings/07-reading.Rmd", echo=FALSE, error=TRUE}
```


```{r, child="labs/07-lab-string-processing_student.Rmd"}
```