python - Function re.sub() refuses to work when I change an ANSI string to UNICODE one -


when use ansi characters works expected:

>>> import re >>> r = ur'(\w+)\s+(\w+)\s+(\w+)\?' >>> s = 'what it?' >>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.unicode) u'what<br>is<br>it<br>' 

but when change string s similar 1 contains of unicode characters - doesn't work want:

>>> s = u'что это есть?' >>> re.sub(r, ur'\1<br>\2<br>\3<br>', s, re.unicode) u'\u0427\u0442\u043e \u044d\u0442\u043e \u0435\u0441\u0442\u044c?' 

it looks strange (the string stays unchanged) because use re.unicode in both cases... re.match matches groups unicode flag:

>>> m = re.match(r, s, re.unicode) >>> m.group(1) u'\u0447\u0442\u043e' >>> m.group(2) u'\u044d\u0442\u043e' >>> m.group(3) u'\u0435\u0441\u0442\u044c' 

you have specify re.unicode flags parameter

re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.unicode) 

otherwise python correctly assumes 4th parameter count, specified in re documenation.

full example:

#!/usr/bin/env python # -*- coding: utf-8 -*-  import re r = ur'(\w+)\s+(\w+)\s+(\w+)\?' #s = 'what it?' s = u'что это есть?' print re.sub(r, ur'\1<br>\2<br>\3<br>', s, flags = re.unicode).encode('utf-8') 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -