unicode results from the CPAN

CCCP-Encode
view release on metacpan or search on metacpan
    print $data;
    # output:
    # ÐµÑÐ»Ð¸ Ð² ÑÐ»Ð¾Ð²Ðµ 'Ñ…Ð»ÐµÐ±' Ð¿Ð¾Ð¼ÐµÐ½ÑÑ‚ÑŒ 4 Ð±ÑƒÐºÐ²Ñ‹, Ñ‚Ð¾ Ð¿Ð¾Ð»ÑƒÑ‡Ð¸Ñ‚ÑÑ ? ÐŸÐ˜Ð’Ðž

Method C<from_to> from module C<Encode> replace uncnown character on 'B<?>'. This data go to save in your database.
And you write a guano-magic code for fixing this problem. 
All developers, who have database not in utf, known about this problem.  

And another case:

Getting data from rss-channels in utf-8 and saving in C<cyrillic> database 
(for example mysql with default charset C<koi8-r> or C<windows-1251>).

B<CCCP::Encode> fix this problem.

=head2 METHODS

=head3 utf2cyrillic($str,$to)

C<$str> target string. C<$to> encoding name, analogue C<$to> in C<Encode::from_to($str,'utf-8',$to)> 

=head2 PACKAGE VARIABLES

=head3 $CCCP::Encode::Entities

Ignored if $CCCP::Encode::ToText is true.
Default value 'xml'.
'xml' mode - replace all uncnown character in traget charset to valid xml numeric entities (i.e. &#x2014;).
'html' mode - replace all uncnown character in traget charset to html numeric entities (i.e. &#8212;).

=head3 $CCCP::Encode::ToText

Default is false. 

If C<$CCCP::Encode::ToText> is false, when C<utf2cyrillic> 
return decode string whis replace uncnown character from you definition (see C<$CCCP::Encode::CharMap>) 
or html entities from C<HTML::Entities>.

If C<$CCCP::Encode::ToText> is true, when C<utf2cyrillic> 
return decode string in plain/text format whis replace uncnown character from you definition (see C<$CCCP::Encode::CharMap>) 
or used C<Text::Unidecode>.

=head3 $CCCP::Encode::CharMap

Default is empty hashref. 

You can custom define map for any characters. 
This is wery flexible if you need custom replace (different of C<HTML::Entities> or C<Text::Unidecode>).
Example:

    $CCCP::Encode::CharMap = {
    	"\x{2014}" => '-',
    	"\x{2015}" => 'foo'
    };

=head3 $CCCP::Encode::Regexp

By default value is C<[^\p{Cyrillic}|\p{IsLatin}|\p{InBasic_Latin}]>  - replace any character which not in Cyrillic or Latin map exist. 
You can override this expression. 

See more on C<http://www.regular-expressions.info/unicode.html>

=head1 OVERHEAD

    CCCP::Encode with $CCCP::Encode::Entities eq "html":  
        2 wallclock secs ( 1.63 usr +  0.01 sys =  1.64 CPU) @ 60975.61/s (n=100000)
    
    CCCP::Encode with $CCCP::Encode::Entities eq "xml":  
        3 wallclock secs ( 2.49 usr +  0.00 sys =  2.49 CPU) @ 40160.64/s (n=100000)
    
    CCCP::Encode with $CCCP::Encode::ToText eq "1":  
        4 wallclock secs ( 3.85 usr +  0.02 sys =  3.87 CPU) @ 25839.79/s (n=100000)
            
    Encode::from_to(...) :  
        2 wallclock secs ( 1.93 usr +  0.01 sys =  1.94 CPU) @ 51546.39/s (n=100000)

=head1 SEE ALSO

=over 4

=item *

C<Encode>

=item *

C<Text::Unidecode>

=back

=head1 AUTHOR

Ivan Sivirinov

=cut
( run in 1.113 second using v1.01-cache-2.11-cpan-c966e8aa7e8 )