Perl Unicode Cookbook: The Standard Preamble
Editor’s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. This is the first recipe in the series.
℞ 0: Standard preamble
Unless otherwise noted, all examples in this cookbook require this standard preamble to work correctly, with the #!
adjusted to work on your system:
#!/usr/bin/env perl
use utf8; # so literals and identifiers can be in UTF-8
use v5.12; # or later to get "unicode_strings" feature
use strict; # quote strings, declare variables
use warnings; # on by default
use warnings qw(FATAL utf8); # fatalize encoding glitches
use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
use charnames qw(:full :short); # unneeded in v5.16
This does make even Unix programmers binmode
your binary streams, or open them with :raw
, but that’s the only way to get at them portably anyway.
WARNING: use autodie
and use open
do not get along with each other.
This combination of features sets Perl to a known state of Unicode compatibility and strictness, so that subsequent operations behave as you expect.
The other recipes in this cookbook are:
- ℞ 0: The Standard Preamble
- ℞ 1: Always Decompose and Recompose
- ℞ 2: Fine-Tuning Unicode Warnings
- ℞ 3: Enable UTF-8 Literals
- ℞ 4: Characters and Their Numbers
- ℞ 5: Unicode Literals by Number
- ℞ 6: Get Character Names by Number
- ℞ 7: Get Character Number by Name
- ℞ 8: Unicode Named Characters
- ℞ 9: Unicode Named Character Sequences
- ℞ 10: Custom Named Characters
- ℞ 11: Names of CJK Codepoints
- ℞ 12: Explicit encode/decode
- ℞ 13: Decode @ARGV as UTF-8
- ℞ 14: Decode @ARGV as Local Encoding
- ℞ 15: Decode Standard Filehandles as UTF-8
- ℞ 16: Decode Standard Filehandles as Locale Encoding
- ℞ 17: Make File I/O Default to UTF-8
- ℞ 18: Make All I/O Default to UTF-8
- ℞ 19: Specify a File’s Encoding
- ℞ 20: Unicode Casing
- ℞ 21: Case-insensitive Comparisons
- ℞ 22: Match Unicode Linebreak Sequence
- ℞ 23: Get Character Categories
- ℞ 24: Disable Unicode-awareness in Builtin Character Classes
- ℞ 25: Match Unicode Properties in Regex
- ℞ 26: Custom Character Properties
- ℞ 27: Unicode Normalization
- ℞ 28: Convert non-ASCII Unicode Numerics
- ℞ 29: Match Unicode Grapheme Cluster in Regex
- ℞ 30: Extract by Grapheme Instead of Codepoint (regex)
- ℞ 31: Extract by Grapheme Instead of Codepoint (substr)
- ℞ 32: Reverse String by Grapheme
- ℞ 33: String Length in Graphemes
- ℞ 34: Unicode Column Width for Printing
- ℞ 35: Unicode Collation
- ℞ 36: Case- and Accent-insensitive Sorting
- ℞ 37: Unicode Locale Collation
- ℞ 38: Make cmp Work on Text instead of Codepoints
- ℞ 39: Case- and Accent-insensitive Comparison
- ℞ 40: Case- and Accent-insensitive Locale Comparisons
- ℞ 41: Unicode Linebreaking
- ℞ 42: Unicode Text in Stubborn Libraries
- ℞ 43: Unicode Text in DBM Files (the easy way)
- ℞ 44: Demo of Unicode Collation and Printing
- ℞ 45: Further Resources
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub