regex - How to convert some character into five digit unicode one in Python 3.3?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

regex - How to convert some character into five digit unicode one in Python 3.3?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I'd like to convert some character into five digit unicode on in Python 3.3. For example,

import re
print(re.sub('a', u'u1D15D', 'abc' ))

but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

926 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:27:33+0000

Python unicode escapes either are 4 hex digits (uabcd) or 8 (Uabcdabcd); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros:

>>> 'U0001D15D'
'??'
>>> 'U0001D15D'.encode('unicode_escape')
b'\U0001d15d'

(And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead.

Because you used a uabcd escape, you replaced a in abc with two characters, the codepoint U+1D15 (?, latin letter small capital ou), and the ASCII character D. Using a 32-bit unicode literal works:

>>> import re
>>> print(re.sub('a', 'U0001D15D', 'abc' ))
??bc
>>> print(re.sub('a', u'U0001D15D', 'abc' ).encode('unicode_escape'))
b'\U0001d15dbc'

where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.

Categories

regex - How to convert some character into five digit unicode one in Python 3.3?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags