Last modified: Nov 08, 2024 By Alexander Williams

Python Unicode Regex: Handle International Text Like a Pro

Working with international text in Python requires understanding Unicode regex patterns. This guide will show you how to effectively handle non-ASCII characters in your regular expressions.

Understanding Unicode in Python Regex

Python's re module fully supports Unicode matching. When working with international text, you'll need to use Unicode categories and properties to match specific character types.

Basic Unicode Categories

Unicode categories help you match broad groups of characters. Here's a simple example using \w to match Unicode word characters:


import re

text = "Hello مرحبا 你好 привет"
pattern = r'\w+'
matches = re.findall(pattern, text)
print(matches)


['Hello', 'مرحبا', '你好', 'привет']

Using Unicode Properties

For more specific matching, use Unicode properties with \p{} syntax. First, make sure to use the re.UNICODE flag or the 'u' prefix in your pattern:


text = "Mixed text с кириллицей"
pattern = re.compile(r'[\u0400-\u04FF]+')  # Cyrillic range
matches = pattern.findall(text)
print(matches)

Named Unicode Blocks

You can use named blocks to match specific scripts. This is particularly useful when working with multiple pattern matches:


text = "안녕하세요 Hello こんにちは"
hangul = re.compile(r'[\uAC00-\uD7AF]+')
japanese = re.compile(r'[\u3040-\u309F\u30A0-\u30FF]+')

print("Korean:", hangul.findall(text))
print("Japanese:", japanese.findall(text))

Case-Insensitive Unicode Matching

When using case-insensitive matching with Unicode, always use the re.IGNORECASE flag:


text = "CAFÉ café"
pattern = re.compile(r'café', re.IGNORECASE)
matches = pattern.findall(text)
print(matches)

Common Unicode Categories

Here are some useful Unicode categories for international text processing:

\p{L} - Any kind of letter from any language
\p{N} - Any kind of numeric character
\p{P} - Any kind of punctuation character

Handling Combining Characters

When working with accented characters, you might need to handle combining characters separately. Use re.compile with appropriate flags:


text = "e\u0301"  # é composed of 'e' and combining acute accent
pattern = re.compile(r'e\u0301|é')
print(bool(pattern.match(text)))

Performance Considerations

Unicode regex operations can be slower than ASCII-only patterns. For better performance, compile your patterns when using them repeatedly and use specific character ranges when possible.

Error Handling

Always ensure your source files are properly encoded. Use the encoding declaration at the start of your Python files when working with Unicode literals:


# -*- coding: utf-8 -*-
import re

text = "Hello världen"  # Swedish
pattern = re.compile(r'\w+', re.UNICODE)

Conclusion

Unicode regex in Python provides powerful tools for handling international text. By understanding these patterns and using appropriate flags, you can effectively process text in any language.

Remember to always use the proper encoding and Unicode properties for reliable pattern matching across different character sets and languages.