R gsub & regex tutorial

Question

I am trying to create a new vector of prices from the given text. I am only allowed to use gsub.

test = c('Testing $26,500\ntesting', 
         'Testing tesing $79+\n TOTAL: $79200', 
         'Testing $3880. Testing', 
         'Testing -$69000Engine: $69000100%',
         'Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5')

Desired Output:

> [1] 26500 79200  3880 69000  7495

I have tried multiple regular expressions but I can’t get the correct results.

First attempt:

> gsub(".*\\$(\\d+)[,|.](\\d+).*", "\\1\\2", test)
[1] "26500"
[2] "Testing tesing $79+\n TOTAL: $79200"                                 
[3] "Testing $3880. Testing"                                              
[4] "Testing -$69000Engine: $69000100%"                                   
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"

Second attempt:

> gsub(".*\\$(\\d+)[,|.].*", "\\1", test) 
[1] "26"                                                                  
[2] "Testing tesing $79+\n TOTAL: $79200"                                 
[3] "3880"                                                                
[4] "Testing -$69000Engine: $69000100%"                                   
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"

Third attempt:

> gsub("(?:.*|.*?*)\\$([0-9]+).*", "\\1", test) 
[1] "26"                                                                  
[2] "79200"                                                               
[3] "3880"                                                                
[4] "69000100"                                                            
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"

Question: How can I fix this and avoid using multiple gsub function calls?

Answer

I don’t believe there is a way to use only 1 call to gsub as you need to pre-process the last price where digits are “disconnected” with spaces, and the first one with a comma decimal separator.

I can only “contract” the code to 2 gsub calls:

gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T) will remove commas and spaces between the digits that follow $ symbol
gsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T) will actually get the required price number out of the strings.

> test <- c("Testing $26,500\ntesting","Testing tesing $79+\n TOTAL: $79200","Testing $3880. > Testing", "Testing -$69000Engine: $69000100%","Testing testing original price : $ 8 2 9 5 . Real > price is $ 7 4 9 5")
> test <- gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)
> test <- gsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T)
> test
[1] "26500" "79200" "3880"  "69000" "7495"

Since you are learning regex, here are regex breakdowns:

Regex 1:

([$]|(?!^)\\G) - match and capture a “leading boundary” construct matching a $ symbol and the location after each successful match with (?!^)\G (\G also matches the beginning of a string, and we eliminate it with a negative look-ahead (?!^) )
[\\s,]* - match 0 or more commas or whitespace
(\\d) - match and capture a digit

With \1\2 replacement pattern, we restore the $ symbol and the digits after it inside the string.

Regex 2:

^ - Beginning of a string
(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+)) a branch-reset group (?|...|...) where capturing group index is reset to 1 (so, we only need to use \1 reference in the replacement pattern to address both (\\d+) from each alternative) matching….
- [\\s\\S]*-[$](\\d+) - any zero or more characters ([\s\S]*) followed with a hyphen, then a $, and then 1 or more digits (\d+, Group1)
- | - or…
- [\\s\\S]*[$](\\d+) - any zero or more characters ([\s\S]*) followed with a $ and then 1 or more digits (\d+, still Group 1)

And we replace all with just \1 back-reference to get our results. - [\\s\\S]*$ - any characters, 0 or more occurrences ([\s\S]*), up to the end of the string ($).

Kevin Vo

R gsub & regex tutorial

Question

Answer

You might also enjoy (View all posts)