Question
I am trying to create a new vector of prices from the given text. I am only allowed to use gsub
.
test = c('Testing $26,500\ntesting',
'Testing tesing $79+\n TOTAL: $79200',
'Testing $3880. Testing',
'Testing -$69000Engine: $69000100%',
'Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5')
Desired Output:
> [1] 26500 79200 3880 69000 7495
I have tried multiple regular expressions but I can’t get the correct results.
- First attempt:
> gsub(".*\\$(\\d+)[,|.](\\d+).*", "\\1\\2", test)
[1] "26500"
[2] "Testing tesing $79+\n TOTAL: $79200"
[3] "Testing $3880. Testing"
[4] "Testing -$69000Engine: $69000100%"
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
- Second attempt:
> gsub(".*\\$(\\d+)[,|.].*", "\\1", test)
[1] "26"
[2] "Testing tesing $79+\n TOTAL: $79200"
[3] "3880"
[4] "Testing -$69000Engine: $69000100%"
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
- Third attempt:
> gsub("(?:.*|.*?*)\\$([0-9]+).*", "\\1", test)
[1] "26"
[2] "79200"
[3] "3880"
[4] "69000100"
[5] "Testing testing original price : $ 8 2 9 5 . Real price is $ 7 4 9 5"
Question: How can I fix this and avoid using multiple gsub function calls?
Answer
I don’t believe there is a way to use only 1 call to gsub as you need to pre-process the last price where digits are “disconnected” with spaces, and the first one with a comma decimal separator.
I can only “contract” the code to 2 gsub calls:
gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)
will remove commas and spaces between the digits that follow $ symbolgsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T)
will actually get the required price number out of the strings.
> test <- c("Testing $26,500\ntesting","Testing tesing $79+\n TOTAL: $79200","Testing $3880. > Testing", "Testing -$69000Engine: $69000100%","Testing testing original price : $ 8 2 9 5 . Real > price is $ 7 4 9 5")
> test <- gsub("([$]|(?!^)\\G)[\\s,]*(\\d)", "\\1\\2", test, perl=T)
> test <- gsub("^(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))[\\s\\S]*$", "\\1", test, perl=T)
> test
[1] "26500" "79200" "3880" "69000" "7495"
Since you are learning regex, here are regex breakdowns:
Regex 1:
([$]|(?!^)\\G)
- match and capture a “leading boundary” construct matching a$
symbol and the location after each successful match with(?!^)\G
(\G
also matches the beginning of a string, and we eliminate it with a negative look-ahead(?!^)
)[\\s,]*
- match 0 or more commas or whitespace(\\d)
- match and capture a digit
With \1\2
replacement pattern, we restore the $
symbol and the digits after it inside the string.
Regex 2:
^
- Beginning of a string-
(?|[\\s\\S]*-[$](\\d+)|[\\s\\S]*[$](\\d+))
a branch-reset group(?|...|...)
where capturing group index is reset to 1 (so, we only need to use\1
reference in the replacement pattern to address both(\\d+)
from each alternative) matching….[\\s\\S]*-[$](\\d+)
- any zero or more characters ([\s\S]*
) followed with a hyphen, then a$
, and then 1 or more digits (\d+
, Group1)|
- or…[\\s\\S]*[$](\\d+)
- any zero or more characters ([\s\S]*
) followed with a$
and then 1 or more digits (\d+
, still Group 1)
And we replace all with just \1
back-reference to get our results. - [\\s\\S]*$
- any characters, 0 or more occurrences ([\s\S]*
), up to the end of the string ($
).