Why does this regex space in the last match? -
i have following text:
2 hcl + 12 na + 3 (na₃cl₂)₂₄ → 2 nacl + h₂
i match each molecule, including coefficient. regex below working, space character, right before last match, getting matched, shouldn't. here's regex i'm using:
(([0-9]* ??\(*([a-z]+[₀-₉]*)+\)*[₀-₉]*))
if @ regex101 link, might easier see problem is: https://regex101.com/r/hk7jy6/1
update
if strings valid chemical formulae, why bother using subscript/digits/letters? there non-whitespace symbols. since there must obligatory letter or (, use them in character class [a-z(], , append \s* (zero or more non-whitespaces):
/(?:\d+ )?[a-z(]\s*/gi see regex demo. (?:...)? construct optional non-capturing group (i.e. group used group, not capture (=store submatch inside memory buffer).
original answer explanation of root cause
you have digits , space pattern @ beginning optional subpatterns, instead, need match them obligatorily, place optional group:
(?:[0-9]+ )?\(*([a-z]+[₀-₉]*)+\)*[₀-₉]* see regex demo
your [0-9]* ?? turned (?:[0-9]+ )?. note here not have use lazy version of ? quantifier, work same way greedy one. removed 2 unnecessary outer grouping (...).
since (?:[0-9]+ )? group optional, space matched if there digit in front of it. if there no digit, next character can matched 0 or more (. then, [a-z] letter should present (if there no (, letter first character in match).
let me break down:
(?:[0-9]+ )?- optional 1 or more digits followed space\(*- 0 or more((maybe meant?)([a-z]+[₀-₉]*)+- 0 or more sequences of 1 or more letters followed 0 or more sbscript digits\)*- 0 or more)(maybe meant?)[₀-₉]*- 0 or more subscript digits
if want make sure not match (ca or h), should split \(*...\)* this:
(?:[0-9]+ )?(?:(?:[a-z]+[₀-₉]*)+|\((?:[a-z]+[₀-₉]*)+\))[₀-₉]* see another demo
Comments
Post a Comment