Perl Unicode Cookbook: Make cmp Work on Text instead of Codepoints

Jun 7, 2012 by Tom Christiansen

℞ 38: Making `cmp` work on text instead of codepoints

Even with Perl 5.12’s “unicode_strings” feature, some of Perl’s core operations do not perform as expected on Unicode strings by default. For example, how is the cmp operator to know whether its arguments are octets, larger codepoints, or graphemes, or whether a specific collation should be in effect?

Where you might write:

 @srecs = sort {
     $b->{AGE}   <=>  $a->{AGE}
                 ||
     $a->{NAME}  cmp  $b->{NAME}
 } @recs;

… a Unicode-aware comparison should instead use Unicode::Collate:

 my $coll = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

This module’s getSortKey() method returns an appropriate form sort key respecting the appropriate collation (and collation level) for a given Unicode string. cmp can handle these keys effectively.

Previous: ℞ 37: Unicode Locale Collation

Series Index: The Standard Preamble

Next: ℞ 39: Case- and Accent-insensitive Comparison

Tags

unicode