Question
I am trying to create a new vector of prices from the given text. I am only allowed to use gsub.
test = c('Testing $26,500\ntesting',
'Testing tesing $79+\n TOTAL: $79200',
'Testing $3880. Testing',
'Testing -$69000Engine: $69000100%',
'Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5')
Desired Output:
> [1] 26500 79200 3880 69000 7495
I have tried multiple regular expressions but I can’t get the correct results.
- First attempt:
> gsub(".*\\$(\\d+)[,|.](\\d+).*", "\\1\\2", test)
[1] "26500"
[2] "Testing tesing $79+\n TOTAL: $79200"
[3] "Testing $3880. Testing"
[4] "Testing -$69000Engine: $69000100%"
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
- Second attempt:
> gsub(".*\\$(\\d+)[,|.].*", "\\1", test)
[1] "26"
[2] "Testing tesing $79+\n TOTAL: $79200"
[3] "3880"
[4] "Testing -$69000Engine: $69000100%"
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
- Third attempt:
> gsub("(?:.*|.*?*)\\$([0-9]+).*", "\\1", test)
[1] "26"
[2] "79200"
[3] "3880"
[4] "69000100"
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
Question: How can I fix this and avoid using multiple gsub function calls?
Answer
I don’t believe there is a way to use only 1 call to gsub as you need to pre-process the last price where digits are “disconnected” with spaces, and the first one with a comma decimal separator.
I can only “contract” the code to 2 gsub calls:
gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)will remove commas and spaces between the digits that follow $ symbolgsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T)will actually get the required price number out of the strings.
> test <- c("Testing $26,500\ntesting","Testing tesing $79+\n TOTAL: $79200","Testing $3880. > Testing", "Testing -$69000Engine: $69000100%","Testing testing original price : $ 8 2 9 5 . Real > price is $ 7 4 9 5")
> test <- gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)
> test <- gsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T)
> test
[1] "26500" "79200" "3880" "69000" "7495"
Since you are learning regex, here are regex breakdowns:
Regex 1:
([$]|(?!^)\\G)- match and capture a “leading boundary” construct matching a$symbol and the location after each successful match with(?!^)\G(\Galso matches the beginning of a string, and we eliminate it with a negative look-ahead(?!^))[\\s,]*- match 0 or more commas or whitespace(\\d)- match and capture a digit
With \1\2 replacement pattern, we restore the $ symbol and the digits after it inside the string.
Regex 2:
^- Beginning of a string-
(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))a branch-reset group(?|...|...)where capturing group index is reset to 1 (so, we only need to use\1reference in the replacement pattern to address both(\\d+)from each alternative) matching….[\\s\\S]*-[$](\\d+)- any zero or more characters ([\s\S]*) followed with a hyphen, then a$, and then 1 or more digits (\d+, Group1)|- or…[\\s\\S]*[$](\\d+)- any zero or more characters ([\s\S]*) followed with a$and then 1 or more digits (\d+, still Group 1)
And we replace all with just \1 back-reference to get our results. - [\\s\\S]*$ - any characters, 0 or more occurrences ([\s\S]*), up to the end of the string ($).