python - Function re.sub() refuses to work when I change an ANSI string to UNICODE one -
when use ansi characters works expected:
>>> import re >>> r = ur'(\w+)\s+(\w+)\s+(\w+)\?' >>> s = 'what it?' >>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.unicode) u'what<br>is<br>it<br>'
but when change string s
similar 1 contains of unicode characters - doesn't work want:
>>> s = u'что это есть?' >>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.unicode) u'\u0427\u0442\u043e \u044d\u0442\u043e \u0435\u0441\u0442\u044c?'
it looks strange (the string stays unchanged) because use re.unicode
in both cases... re.match
matches groups unicode
flag:
>>> m = re.match(r, s, re.unicode) >>> m.group(1) u'\u0447\u0442\u043e' >>> m.group(2) u'\u044d\u0442\u043e' >>> m.group(3) u'\u0435\u0441\u0442\u044c'
you have specify re.unicode
flags
parameter
re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.unicode)
otherwise python correctly assumes 4th parameter count
, specified in re documenation.
full example:
#!/usr/bin/env python # -*- coding: utf-8 -*- import re r = ur'(\w+)\s+(\w+)\s+(\w+)\?' #s = 'what it?' s = u'что это есть?' print re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.unicode).encode('utf-8')
Comments
Post a Comment