 ### Kevin Vo

Data Enthusiast, works in R and Python

# Question

I am trying to create a new vector of prices from the given text. I am only allowed to use `gsub`.

``````test = c('Testing \$26,500\ntesting',
'Testing tesing \$79+\n TOTAL: \$79200',
'Testing \$3880. Testing',
'Testing -\$69000Engine: \$69000100%',
'Testing testing original price : \$ 8 2 9 5 . Real price is \$ 7 4 9 5')
``````

Desired Output:

``````>  26500 79200  3880 69000  7495
``````

I have tried multiple regular expressions but I can’t get the correct results.

• First attempt:
``````> gsub(".*\\\$(\\d+)[,|.](\\d+).*", "\\1\\2", test)
 "26500"
 "Testing tesing \$79+\n TOTAL: \$79200"
 "Testing \$3880. Testing"
 "Testing -\$69000Engine: \$69000100%"
 "Testing testing original price : \$ 8 2 9 5 . Real price is \$ 7 4 9 5"
``````
• Second attempt:
``````> gsub(".*\\\$(\\d+)[,|.].*", "\\1", test)
 "26"
 "Testing tesing \$79+\n TOTAL: \$79200"
 "3880"
 "Testing -\$69000Engine: \$69000100%"
 "Testing testing original price : \$ 8 2 9 5 . Real price is \$ 7 4 9 5"
``````
• Third attempt:
``````> gsub("(?:.*|.*?*)\\\$([0-9]+).*", "\\1", test)
 "26"
 "79200"
 "3880"
 "69000100"
 "Testing testing original price : \$ 8 2 9 5 . Real price is \$ 7 4 9 5"
``````

Question: How can I fix this and avoid using multiple gsub function calls?

I don’t believe there is a way to use only 1 call to gsub as you need to pre-process the last price where digits are “disconnected” with spaces, and the first one with a comma decimal separator.

I can only “contract” the code to 2 gsub calls:

• `gsub("([\$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)` will remove commas and spaces between the digits that follow \$ symbol
• `gsub("^(?|[\\s\\S]*-[\$](\\d+)|[\\s\\S]*[\$](\\d+))[\\s\\S]*\$", "\\1", test, perl=T)` will actually get the required price number out of the strings.
``````> test <- c("Testing \$26,500\ntesting","Testing tesing \$79+\n TOTAL: \$79200","Testing \$3880. > Testing", "Testing -\$69000Engine: \$69000100%","Testing testing original price : \$ 8 2 9 5 . Real > price is \$ 7 4 9 5")
> test <- gsub("([\$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)
> test <- gsub("^(?|[\\s\\S]*-[\$](\\d+)|[\\s\\S]*[\$](\\d+))[\\s\\S]*\$", "\\1", test, perl=T)
> test
 "26500" "79200" "3880"  "69000" "7495"
``````

Since you are learning regex, here are regex breakdowns:

Regex 1:

• `([\$]|(?!^)\\G)` - match and capture a “leading boundary” construct matching a `\$` symbol and the location after each successful match with `(?!^)\G` (`\G` also matches the beginning of a string, and we eliminate it with a negative look-ahead `(?!^)` )
• `[\\s,]*` - match 0 or more commas or whitespace
• `(\\d)` - match and capture a digit

With `\1\2` replacement pattern, we restore the `\$` symbol and the digits after it inside the string.

Regex 2:

• `^` - Beginning of a string
• `(?|[\\s\\S]*-[\$](\\d+)|[\\s\\S]*[\$](\\d+))` a branch-reset group `(?|...|...)` where capturing group index is reset to 1 (so, we only need to use `\1` reference in the replacement pattern to address both `(\\d+)` from each alternative) matching….

• `[\\s\\S]*-[\$](\\d+)` - any zero or more characters (`[\s\S]*`) followed with a hyphen, then a `\$`, and then 1 or more digits (`\d+`, Group1)
• `|` - or…
• `[\\s\\S]*[\$](\\d+)` - any zero or more characters (`[\s\S]*`) followed with a `\$` and then 1 or more digits (`\d+`, still Group 1)

And we replace all with just `\1` back-reference to get our results. - `[\\s\\S]*\$` - any characters, 0 or more occurrences (`[\s\S]*`), up to the end of the string (`\$`).