UNITCHECK results from the CPAN

UNITCHECK
Text-ASCII-Convert
view release on metacpan or search on metacpan
lib/Text/ASCII/Convert.pm view on Meta::CPAN
can be a string of Unicode characters or a string of UTF-8 octets. The output is always a string of ASCII characters
in the range 0x00 to 0x7F.

This is most useful for catching spam that uses non-ASCII characters to obfuscate words. For example,

    Ãou hÃ£ve a nÃ¨w vÃ²ice-mÃ£il
    You havÃ© Reêž“eÃ¬ved an Enêž“ryptÃ©d Company MaÃl

would be converted to

    You have a new voice-mail
    You have ReCeived an EnCrypted Company Mail

Unlike other transliteration software, this plugin converts non-ASCII characters
to their ASCII equivalents based on appearance instead of meaning. For example, the
German eszett character 'ÃŸ' is converted to the Roman letter 'B' instead of 'ss'
because it resembles a 'B' in appearance. Likewise, the Greek letter Sigma ('Î£') is
converted to 'E' and a lower case Omega ('Ï‰') is converted to 'w' even though these
letters have different lexical meanings.

Not all non-ASCII characters are converted. For example, the Japanese Hiragana
character 'ã‚' is not converted because it does not resemble any ASCII character.
Characters that have no ASCII equivalent are replaced by spaces. To avoid long runs
of spaces, multiple spaces are collapsed into a single space. For example,

    Find ðŸ’‹ðŸ’˜SinglesðŸ’‹ðŸ’˜ in your Area

would be converted to

    Find Singles in your Area

The plugin also removes zero-width characters such as the zero-width
space (U+200B) and zero-width non-joiner (U+200C) that are often used to
obfuscate words.

Control characters such as tabs, newlines, and carriage returns are retained.

=head1 AUTHORS

Kent Oyer <kent@mxguardian.net>

=head1 LICENSE AND COPYRIGHT

Copyright (C) 2023 MXGuardian LLC

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the LICENSE
file included with this distribution for more information.

You should have received a copy of the GNU General Public License
along with this program.  If not, see https://www.gnu.org/licenses/.

=cut

UNITCHECK {
    # build character map from __DATA__ section
    while (<DATA>) {
        chomp;
        my ($key,$value) = split /\s+/;
        my $ascii = join('', map { chr(hex($_)) } split /\+/, $value);
        $char_map{chr(hex($key))} = $ascii;
    }
    close DATA;
};

# Converts a string of Unicode characters (or UTF-8 encoded bytes) to a string of ASCII characters
# in the range 0x00 to 0x7F. Non-ASCII characters are replaced with their ASCII equivalents.
# Zero-width characters and combining marks are removed. Multiple spaces are collapsed into a single space.
#
sub convert_to_ascii {
    my $str = is_valid_utf_8($_[0]) ? decode('UTF-8', $_[0]) : $_[0];
    # remove zero-width characters and combining marks
    $str =~ s/[\xAD\x{034F}\x{200A}-\x{200F}\x{202A}\x{202B}\x{202C}\x{2060}\x{FEFF}]|\p{Combining_Mark}//g;
    # replace non-ascii characters with ascii equivalents
    $str =~ s/([^[:ascii:]])/defined($char_map{$1})?$char_map{$1}:' '/eg;
    # collapse spaces
    $str =~ s/\x{20}+/ /g;
    return $str;
}

# returns true if the provided string of octets represents a syntactically
# valid UTF-8 string, otherwise a false is returned.
# Copied from Mail::SpamAssassin::Util::is_valid_utf8
#
sub is_valid_utf_8 {
    return undef if !defined $_[0];
    #
    # RFC 6532: UTF8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4
    # RFC 3629 section 4: Syntax of UTF-8 Byte Sequences
    #   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
    #   UTF8-1      = %x00-7F
    #   UTF8-2      = %xC2-DF UTF8-tail
    #   UTF8-3      = %xE0 %xA0-BF UTF8-tail /
    #                 %xE1-EC 2( UTF8-tail ) /
    #                 %xED %x80-9F UTF8-tail /
    #                   # U+D800..U+DFFF are utf16 surrogates, not legal utf8
    #                 %xEE-EF 2( UTF8-tail )
    #   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) /
    #                 %xF1-F3 3( UTF8-tail ) /
    #                 %xF4 %x80-8F 2( UTF8-tail )
    #   UTF8-tail   = %x80-BF
    #
    # loose variant:
    #   [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] |
    #   [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF4][\x80-\xBF]{3}
    #
    $_[0] =~ /^ (?: [\x00-\x7F] |
                  [\xC2-\xDF] [\x80-\xBF] |
                  \xE0 [\xA0-\xBF] [\x80-\xBF] |
                  [\xE1-\xEC] [\x80-\xBF]{2} |
                  \xED [\x80-\x9F] [\x80-\xBF] |
                  [\xEE-\xEF] [\x80-\xBF]{2} |
                  \xF0 [\x90-\xBF] [\x80-\xBF]{2} |
                  [\xF1-\xF3] [\x80-\xBF]{3} |
                  \xF4 [\x80-\x8F] [\x80-\xBF]{2} )* \z/xs ? 1 : 0;
( run in 1.182 second using v1.01-cache-2.11-cpan-9581c071862 )