utf8 results from the CPAN

utf8
AnyEvent
view release on metacpan or search on metacpan
Croaks when it cannot decode the string.

=cut

sub punycode_encode($) {
   require "AnyEvent/Util/idna.pl";
   goto &punycode_encode;
}

sub punycode_decode($) {
   require "AnyEvent/Util/idna.pl";
   goto &punycode_decode;
}

=item AnyEvent::Util::idn_nameprep $idn[, $display]

Implements the IDNA nameprep normalisation algorithm. Or actually the
UTS#46 algorithm. Or maybe something similar - reality is complicated
between IDNA2003, UTS#46 and IDNA2008. If C<$display> is true then the name
is prepared for display, otherwise it is prepared for lookup (default).

If you have no clue what this means, look at C<idn_to_ascii> instead.

This function is designed to avoid using a lot of resources - it uses
about 1MB of RAM (most of this due to Unicode::Normalize). Also, names
that are already "simple" will only be checked for basic validity, without
the overhead of full nameprep processing.

=cut

our ($uts46_valid, $uts46_imap);

sub idn_nameprep($;$) {
   local $_ = $_[0];

   # lowercasing these should always be valid, and is required for xn-- detection
   y/A-Z/a-z/;

   if (/[^0-9a-z\-.]/) {
      # load the mapping data
      unless (defined $uts46_imap) {
         require Unicode::Normalize;
         require "AnyEvent/Util/uts46data.pl";
      }

      # uts46 nameprep

      # I naively tried to use a regex/transliterate approach first,
      # with one regex and one y///, but the compiled code was 4.5MB.
      # this version has a bit-table for the valid class, and
      # a char-replacement search string

      # for speed (cough) reasons, we skip-case 0-9a-z, -, ., which
      # really ought to be trivially valid. A-Z is valid, but already lowercased.
      s{
         ([^0-9a-z\-.])
      }{
         my $chr = $1;
         unless (vec $uts46_valid, ord $chr, 1) {
            # not in valid class, search for mapping
            utf8::encode $chr; # the imap table is in utf-8
            (my $rep = index $uts46_imap, "\x00$chr") >= 0
               or Carp::croak "$_[0]: disallowed characters (U+" . (unpack "H*", $chr) . ") during idn_nameprep";

            (substr $uts46_imap, $rep, 128) =~ /\x00 .[\x80-\xbf]* ([^\x00]*) \x00/x
               or die "FATAL: idn_nameprep imap table has unexpected contents";

            $rep = $1;
            $chr = $rep unless $rep =~ s/^\x01// && $_[1]; # replace unless deviation and display
            utf8::decode $chr;
         }
         $chr
      }gex;

      # KC
      $_ = Unicode::Normalize::NFKC ($_);
   }

   # decode punycode components, check for invalid xx-- prefixes
   s{
      (^|\.)(..)--([^\.]*)
   }{
      my ($pfx, $ace, $pc) = ($1, $2, $3);

      if ($ace eq "xn") {
         $pc = punycode_decode $pc; # will croak on error (we hope :)

         require Unicode::Normalize;
         $pc eq Unicode::Normalize::NFC ($pc)
            or Carp::croak "$_[0]: punycode label not in NFC detected during idn_nameprep";

         "$pfx$pc"
      } elsif ($ace !~ /^[a-z0-9]{2}$/) {
         "$pfx$ace--$pc"
      } else {
         Carp::croak "$_[0]: hyphens in 3rd/4th position of a label are not allowed";
      }
   }gex;

   # uts46 verification
   /\.-|-\./
      and Carp::croak "$_[0]: invalid hyphens detected during idn_nameprep";

   # missing: label begin with combining mark, idna2008 bidi

   # now check validity of each codepoint
   if (/[^0-9a-z\-.]/) {
      # load the mapping data
      unless (defined $uts46_imap) {
         require "AnyEvent/Util/uts46data.pl";
      }

      vec $uts46_valid, ord, 1
         or $_[1] && 0 <= index $uts46_imap, pack "C0U*", 0, ord, 1 # deviation == \x00$chr\x01
         or Carp::croak "$_[0]: disallowed characters during idn_nameprep"
         for split //;
   }

   $_
}

=item $domainname = AnyEvent::Util::idn_to_ascii $idn

Converts the given unicode string (C<$idn>, international domain name,
e.g. æ—¥æœ¬èªžã€‚ï¼ªï¼°) to a pure-ASCII domain name (this is usually
called the "IDN ToAscii" transform). This transformation is idempotent,
which means you can call it just in case and it will do the right thing.

Unlike some other "ToAscii" implementations, this one works on full domain
names and should never fail - if it cannot convert the name, then it will
( run in 1.295 second using v1.01-cache-2.11-cpan-995e09ba956 )