Perl Unicode Cookbook: Extract by Grapheme Instead of Codepoint (regex)

℞ 30: Extract by grapheme instead of by codepoint (regex)

Remember that Unicode defines a grapheme as “what a user thinks of as a character”. A codepoint is an integer value in the Unicode codespace. While ASCII conflates the two, effective Unicode use respects the difference between user-visible characters and their representations.

Use the \X regex metacharacter when you need to extract graphemes from a string instead of codepoints:

 # match and grab five first graphemes
 my ($first_five) = $str =~ /^ ( \X{5} ) /x;

Previous: ℞ 29: Match Unicode Grapheme Cluster in Regex

Series Index: The Standard Preamble

Next: ℞ 31: Extract by Grapheme Instead of Codepoint (substr)

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub