Big5

 view release on metacpan or  search on metacpan

lib/Big5.pm  view on Meta::CPAN


  ex. Japanese Katakana "SO" like [ `/ ] code is "\x83\x5C" in SJIS
 
                  see     hex dump
  -----------------------------------------
  source script   "`/"    [83 5c]
  -----------------------------------------
 
  Here, use SJIS;
                          hex dump
  -----------------------------------------
  escaped script  "`\/"   [83 [5c] 5c]
  -----------------------------------------
                    ^--- escape by SJIS software
 
  by the by       see     hex dump
  -----------------------------------------
  your eye's      "`/\"   [83 5c] [5c]
  -----------------------------------------
  perl eye's      "`\/"   [83] \[5c]
  -----------------------------------------
 
                          hex dump
  -----------------------------------------
  in the perl     "`/"    [83] [5c]
  -----------------------------------------

=head1 Multiple-Octet Anchoring of Regular Expression (Big5 software provides)

Big5 software applies multiple-octet anchoring at beginning of regular expression.

  --------------------------------------------------------------------------------
  Before                  After
  --------------------------------------------------------------------------------
  m/regexp/               m/${Ebig5::anchor}(?:regexp).../
  --------------------------------------------------------------------------------

=head1 Escaping Second Octet (Big5 software provides)

Big5 software escapes second octet of multiple-octet character in regular
expression.

  --------------------------------------------------------------------------------
  Before                  After
  --------------------------------------------------------------------------------
  m<...`/...>             m<...`/\...>
  --------------------------------------------------------------------------------

=head1 Multiple-Octet Character Regular Expression (Big5 software provides)

Big5 software clusters multiple-octet character with quantifier, makes cluster from
multiple-octet custom character classes. And makes multiple-octet version metasymbol
from classic Perl character class shortcuts and POSIX-style character classes.

  --------------------------------------------------------------------------------
  Before                  After
  --------------------------------------------------------------------------------
  m/...MULTIOCT+.../      m/...(?:MULTIOCT)+.../
  m/...[AN-EM].../        m/...(?:A[N-Z]|[B-D][A-Z]|E[A-M]).../
  m/...\D.../             m/...(?:${Ebig5::eD}).../
  m/...[[:^digit:]].../   m/...(?:${Ebig5::not_digit}).../
  --------------------------------------------------------------------------------

=head1 Calling 'Ebig5::ignorecase()' (Big5 software provides)

Big5 software applies calling 'Ebig5::ignorecase()' instead of /i modifier.

  --------------------------------------------------------------------------------
  Before                  After
  --------------------------------------------------------------------------------
  m/...$var.../i          m/...@{[Ebig5::ignorecase($var)]}.../
  --------------------------------------------------------------------------------

=head1 Character-Oriented Regular Expression

Regular expression works as character-oriented that has no /b modifier.

  --------------------------------------------------------------------------------
  Before                  After
  --------------------------------------------------------------------------------
  /regexp/                /ditto$Ebig5::matched/
  m/regexp/               m/ditto$Ebig5::matched/
  ?regexp?                m?ditto$Ebig5::matched?
  m?regexp?               m?ditto$Ebig5::matched?
 
  $_ =~                   ($_ =~ m/ditto$Ebig5::matched/) ?
  s/regexp/replacement/   CORE::eval{ Ebig5::s_matched(); local $^W=0; my $__r=qq/replacement/; $_="${1}$__r$'"; 1 } :
                          undef
 
  $_ !~                   ($_ !~ m/ditto$Ebig5::matched/) ?
  s/regexp/replacement/   1 :
                          CORE::eval{ Ebig5::s_matched(); local $^W=0; my $__r=qq/replacement/; $_="${1}$__r$'"; undef }
 
  split(/regexp/)         Ebig5::split(qr/regexp/)
  split(m/regexp/)        Ebig5::split(qr/regexp/)
  split(qr/regexp/)       Ebig5::split(qr/regexp/)
  qr/regexp/              qr/ditto$Ebig5::matched/
  --------------------------------------------------------------------------------

=head1 Byte-Oriented Regular Expression

Regular expression works as byte-oriented that has /b modifier.

  --------------------------------------------------------------------------------
  Before                  After
  --------------------------------------------------------------------------------
  /regexp/b               /(?:regexp)$Ebig5::matched/
  m/regexp/b              m/(?:regexp)$Ebig5::matched/
  ?regexp?b               m?regexp$Ebig5::matched?
  m?regexp?b              m?regexp$Ebig5::matched?
 
  $_ =~                   ($_ =~ m/(\G[\x00-\xFF]*?)(?:regexp)$Ebig5::matched/) ?
  s/regexp/replacement/b  CORE::eval{ Ebig5::s_matched(); local $^W=0; my $__r=qq/replacement/; $_="${1}$__r$'"; 1 } :
                          undef
 
  $_ !~                   ($_ !~ m/(\G[\x00-\xFF]*?)(?:regexp)$Ebig5::matched/) ?
  s/regexp/replacement/b  1 :
                          CORE::eval{ Ebig5::s_matched(); local $^W=0; my $__r=qq/replacement/; $_="${1}$__r$'"; undef }
 
  split(/regexp/b)        split(qr/regexp/)
  split(m/regexp/b)       split(qr/regexp/)
  split(qr/regexp/b)      split(qr/regexp/)
  qr/regexp/b             qr/(?:regexp)$Ebig5::matched/
  --------------------------------------------------------------------------------

=head1 Escaping Character Classes (Ebig5.pm provides)

The character classes are redefined as follows to backward compatibility.

  ---------------------------------------------------------------
  Before        After
  ---------------------------------------------------------------
   .            ${Ebig5::dot}
                ${Ebig5::dot_s}    (/s modifier)
  \d            [0-9]              (universally)
  \s            \s
  \w            [0-9A-Z_a-z]       (universally)
  \D            ${Ebig5::eD}
  \S            ${Ebig5::eS}
  \W            ${Ebig5::eW}
  \h            [\x09\x20]
  \v            [\x0A\x0B\x0C\x0D]
  \H            ${Ebig5::eH}
  \V            ${Ebig5::eV}
  \C            [\x00-\xFF]
  \X            X                  (so, just 'X')
  \R            ${Ebig5::eR}
  \N            ${Ebig5::eN}
  ---------------------------------------------------------------

Also POSIX-style character classes.

  ---------------------------------------------------------------
  Before        After
  ---------------------------------------------------------------
  [:alnum:]     [\x30-\x39\x41-\x5A\x61-\x7A]
  [:alpha:]     [\x41-\x5A\x61-\x7A]
  [:ascii:]     [\x00-\x7F]
  [:blank:]     [\x09\x20]
  [:cntrl:]     [\x00-\x1F\x7F]
  [:digit:]     [\x30-\x39]
  [:graph:]     [\x21-\x7F]
  [:lower:]     [\x61-\x7A]
                [\x41-\x5A\x61-\x7A]     (/i modifier)
  [:print:]     [\x20-\x7F]
  [:punct:]     [\x21-\x2F\x3A-\x3F\x40\x5B-\x5F\x60\x7B-\x7E]
  [:space:]     [\s\x0B]
  [:upper:]     [\x41-\x5A]
                [\x41-\x5A\x61-\x7A]     (/i modifier)
  [:word:]      [\x30-\x39\x41-\x5A\x5F\x61-\x7A]
  [:xdigit:]    [\x30-\x39\x41-\x46\x61-\x66]
  [:^alnum:]    ${Ebig5::not_alnum}
  [:^alpha:]    ${Ebig5::not_alpha}
  [:^ascii:]    ${Ebig5::not_ascii}
  [:^blank:]    ${Ebig5::not_blank}
  [:^cntrl:]    ${Ebig5::not_cntrl}
  [:^digit:]    ${Ebig5::not_digit}
  [:^graph:]    ${Ebig5::not_graph}
  [:^lower:]    ${Ebig5::not_lower}
                ${Ebig5::not_lower_i}    (/i modifier)
  [:^print:]    ${Ebig5::not_print}
  [:^punct:]    ${Ebig5::not_punct}
  [:^space:]    ${Ebig5::not_space}
  [:^upper:]    ${Ebig5::not_upper}
                ${Ebig5::not_upper_i}    (/i modifier)
  [:^word:]     ${Ebig5::not_word}
  [:^xdigit:]   ${Ebig5::not_xdigit}
  ---------------------------------------------------------------

\b and \B are redefined as follows to backward compatibility.

  ---------------------------------------------------------------
  Before      After
  ---------------------------------------------------------------
  \b          ${Ebig5::eb}
  \B          ${Ebig5::eB}
  ---------------------------------------------------------------

Definitions in Ebig5.pm.

  ---------------------------------------------------------------------------------------------------------------------------------------------------------
  After                    Definition
  ---------------------------------------------------------------------------------------------------------------------------------------------------------
  ${Ebig5::anchor}         qr{\G(?>[^\x81-\xFE]|[\x81-\xFE][\x00-\xFF])*?}
                           for over 32766 octets string on ActivePerl5.6 and Perl5.10 or later
                           qr{\G(?(?=.{0,32766}\z)\G(?>[^\x81-\xFE]|[\x81-\xFE][\x00-\xFF])*?|(?(?=[$sbcs]+\z).*?|(?:.*?[$sbcs](?:[^$sbcs][^$sbcs])*?)))}oxms
  ${Ebig5::dot}            qr{(?>[^\x81-\xFE\x0A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::dot_s}          qr{(?>[^\x81-\xFE]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::eD}             qr{(?>[^\x81-\xFE0-9]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::eS}             qr{(?>[^\x81-\xFE\s]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::eW}             qr{(?>[^\x81-\xFE0-9A-Z_a-z]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::eH}             qr{(?>[^\x81-\xFE\x09\x20]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::eV}             qr{(?>[^\x81-\xFE\x0A\x0B\x0C\x0D]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::eR}             qr{(?>\x0D\x0A|[\x0A\x0D])};
  ${Ebig5::eN}             qr{(?>[^\x81-\xFE\x0A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_alnum}      qr{(?>[^\x81-\xFE\x30-\x39\x41-\x5A\x61-\x7A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_alpha}      qr{(?>[^\x81-\xFE\x41-\x5A\x61-\x7A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_ascii}      qr{(?>[^\x81-\xFE\x00-\x7F]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_blank}      qr{(?>[^\x81-\xFE\x09\x20]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_cntrl}      qr{(?>[^\x81-\xFE\x00-\x1F\x7F]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_digit}      qr{(?>[^\x81-\xFE\x30-\x39]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_graph}      qr{(?>[^\x81-\xFE\x21-\x7F]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_lower}      qr{(?>[^\x81-\xFE\x61-\x7A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_lower_i}    qr{(?>[^\x81-\xFE\x41-\x5A\x61-\x7A]|[\x81-\xFE][\x00-\xFF])}; # Perl 5.16 compatible
# ${Ebig5::not_lower_i}    qr{(?>[^\x81-\xFE]|[\x81-\xFE][\x00-\xFF])};                   # older Perl compatible
  ${Ebig5::not_print}      qr{(?>[^\x81-\xFE\x20-\x7F]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_punct}      qr{(?>[^\x81-\xFE\x21-\x2F\x3A-\x3F\x40\x5B-\x5F\x60\x7B-\x7E]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_space}      qr{(?>[^\x81-\xFE\s\x0B]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_upper}      qr{(?>[^\x81-\xFE\x41-\x5A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_upper_i}    qr{(?>[^\x81-\xFE\x41-\x5A\x61-\x7A]|[\x81-\xFE][\x00-\xFF])}; # Perl 5.16 compatible
# ${Ebig5::not_upper_i}    qr{(?>[^\x81-\xFE]|[\x81-\xFE][\x00-\xFF])};                   # older Perl compatible
  ${Ebig5::not_word}       qr{(?>[^\x81-\xFE\x30-\x39\x41-\x5A\x5F\x61-\x7A]|[\x81-\xFE][\x00-\xFF])};
  ${Ebig5::not_xdigit}     qr{(?>[^\x81-\xFE\x30-\x39\x41-\x46\x61-\x66]|[\x81-\xFE][\x00-\xFF])};
  
  # This solution is not perfect. I beg better solution from you who are reading this.
  ${Ebig5::eb}             qr{(?:\A(?=[0-9A-Z_a-z])|(?<=[\x00-\x2F\x40\x5B-\x5E\x60\x7B-\xFF])(?=[0-9A-Z_a-z])|(?<=[0-9A-Z_a-z])(?=[\x00-\x2F\x40\x5B-\x5E\x60\x7B-\xFF]|\z))};
  ${Ebig5::eB}             qr{(?:(?<=[0-9A-Z_a-z])(?=[0-9A-Z_a-z])|(?<=[\x00-\x2F\x40\x5B-\x5E\x60\x7B-\xFF])(?=[\x00-\x2F\x40\x5B-\x5E\x60\x7B-\xFF]))};
  ---------------------------------------------------------------------------------------------------------------------------------------------------------

=head1 Un-Escaping \ of \b{}, \B{}, \N{}, \p{}, \P{}, and \X (Big5 software provides)

Big5 software removes '\' at head of alphanumeric regexp metasymbols \b{}, \B{},
\N{}, \p{}, \P{} and \X. By this method, you can avoid the trap of the abstraction.

See also,
Deprecate literal unescaped "{" in regexes.
http://perl5.git.perl.org/perl.git/commit/2a53d3314d380af5ab5283758219417c6dfa36e9

  ------------------------------------
  Before           After
  ------------------------------------
  \b{...}          b\{...}
  \B{...}          B\{...}
  \N{CHARNAME}     N\{CHARNAME}
  \p{L}            p\{L}
  \p{^L}           p\{^L}
  \p{\^L}          p\{\^L}
  \pL              pL
  \P{L}            P\{L}
  \P{^L}           P\{^L}
  \P{\^L}          P\{\^L}
  \PL              PL
  \X               X
  ------------------------------------

=head1 Escaping Built-in Functions (Big5 software provides)

Insert 'Ebig5::' at head of function name. Ebig5.pm provides your script Ebig5::*
subroutines.

  -------------------------------------------
  Before      After            Works as
  -------------------------------------------
  length      length           Byte
  substr      substr           Byte
  pos         pos              Byte
  split       Ebig5::split     Character
  tr///       Ebig5::tr        Character
  tr///b      tr///            Byte
  tr///B      tr///            Byte
  y///        Ebig5::tr        Character
  y///b       tr///            Byte
  y///B       tr///            Byte
  chop        Ebig5::chop      Character
  index       Ebig5::index     Character
  rindex      Ebig5::rindex    Character
  lc          Ebig5::lc        Character
  lcfirst     Ebig5::lcfirst   Character
  uc          Ebig5::uc        Character
  ucfirst     Ebig5::ucfirst   Character
  fc          Ebig5::fc        Character
  chr         Ebig5::chr       Character
  glob        Ebig5::glob      Character



( run in 0.730 second using v1.01-cache-2.11-cpan-d8267643d1d )