Kevin Vo bio photo

Kevin Vo

Data Enthusiast, works in R and Python

Email Facebook LinkedIn Github

Question

I am trying to create a new vector of prices from the given text. I am only allowed to use gsub.

test = c('Testing $26,500\ntesting', 
         'Testing tesing $79+\n TOTAL: $79200', 
         'Testing $3880. Testing', 
         'Testing -$69000Engine: $69000100%',
         'Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5')

Desired Output:

> [1] 26500 79200  3880 69000  7495

I have tried multiple regular expressions but I can’t get the correct results.

  • First attempt:
> gsub(".*\\$(\\d+)[,|.](\\d+).*", "\\1\\2", test)
[1] "26500"
[2] "Testing tesing $79+\n TOTAL: $79200"                                 
[3] "Testing $3880. Testing"                                              
[4] "Testing -$69000Engine: $69000100%"                                   
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
  • Second attempt:
> gsub(".*\\$(\\d+)[,|.].*", "\\1", test) 
[1] "26"                                                                  
[2] "Testing tesing $79+\n TOTAL: $79200"                                 
[3] "3880"                                                                
[4] "Testing -$69000Engine: $69000100%"                                   
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
  • Third attempt:
> gsub("(?:.*|.*?*)\\$([0-9]+).*", "\\1", test) 
[1] "26"                                                                  
[2] "79200"                                                               
[3] "3880"                                                                
[4] "69000100"                                                            
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"

Question: How can I fix this and avoid using multiple gsub function calls?

Answer

I don’t believe there is a way to use only 1 call to gsub as you need to pre-process the last price where digits are “disconnected” with spaces, and the first one with a comma decimal separator.

I can only “contract” the code to 2 gsub calls:

  • gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T) will remove commas and spaces between the digits that follow $ symbol
  • gsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T) will actually get the required price number out of the strings.
> test <- c("Testing $26,500\ntesting","Testing tesing $79+\n TOTAL: $79200","Testing $3880. > Testing", "Testing -$69000Engine: $69000100%","Testing testing original price : $ 8 2 9 5 . Real > price is $ 7 4 9 5")
> test <- gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)
> test <- gsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T)
> test
[1] "26500" "79200" "3880"  "69000" "7495" 

Since you are learning regex, here are regex breakdowns:

Regex 1:

  • ([$]|(?!^)\\G) - match and capture a “leading boundary” construct matching a $ symbol and the location after each successful match with (?!^)\G (\G also matches the beginning of a string, and we eliminate it with a negative look-ahead (?!^) )
  • [\\s,]* - match 0 or more commas or whitespace
  • (\\d) - match and capture a digit

With \1\2 replacement pattern, we restore the $ symbol and the digits after it inside the string.

Regex 2:

  • ^ - Beginning of a string
  • (?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+)) a branch-reset group (?|...|...) where capturing group index is reset to 1 (so, we only need to use \1 reference in the replacement pattern to address both (\\d+) from each alternative) matching….

    • [\\s\\S]*-[$](\\d+) - any zero or more characters ([\s\S]*) followed with a hyphen, then a $, and then 1 or more digits (\d+, Group1)
    • | - or…
    • [\\s\\S]*[$](\\d+) - any zero or more characters ([\s\S]*) followed with a $ and then 1 or more digits (\d+, still Group 1)

And we replace all with just \1 back-reference to get our results. - [\\s\\S]*$ - any characters, 0 or more occurrences ([\s\S]*), up to the end of the string ($).