python - Function re.sub() refuses to work when I change an ANSI string to UNICODE one -

when use ansi characters works expected:

>>> import re >>> r = ur'(\w+)\s+(\w+)\s+(\w+)\?' >>> s = 'what it?' >>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.unicode) u'what<br>is<br>it<br>'

but when change string s similar 1 contains of unicode characters - doesn't work want:

>>> s = u'что это есть?' >>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.unicode) u'\u0427\u0442\u043e \u044d\u0442\u043e \u0435\u0441\u0442\u044c?'

it looks strange (the string stays unchanged) because use re.unicode in both cases... re.match matches groups unicode flag:

>>> m = re.match(r, s, re.unicode) >>> m.group(1) u'\u0447\u0442\u043e' >>> m.group(2) u'\u044d\u0442\u043e' >>> m.group(3) u'\u0435\u0441\u0442\u044c'

you have specify re.unicode flags parameter

re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.unicode)

otherwise python correctly assumes 4th parameter count, specified in re documenation.

full example:

#!/usr/bin/env python # -*- coding: utf-8 -*-  import re r = ur'(\w+)\s+(\w+)\s+(\w+)\?' #s = 'what it?' s = u'что это есть?' print re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.unicode).encode('utf-8')

Search This Blog

Ben

python - Function re.sub() refuses to work when I change an ANSI string to UNICODE one -

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

dataset - MPAndroidchart returning no chart Data available -

post - imageshack API cURL -