BOM results from the CPAN

Text-AutoCSV


use Carp;
use Params::Validate qw(validate validate_pos :types);
use List::MoreUtils qw(first_index indexes);
use Fcntl qw(SEEK_SET);
use File::BOM;
use Text::CSV;
use DateTime;

# DateTime::Format::Strptime 1.70 does not work properly with us.
# Actually all version as of 1.63 are fine, except 1.70.

lib/Text/AutoCSV.pm view on Meta::CPAN

            if ( open my $fh, '<:raw', $in_file ) {
                my $bom;
                read $fh, $bom, 3;
                if ( length($bom) == 3 and $bom eq "\xef\xbb\xbf" ) {
                    if ( !defined($via) ) {
                        $m .= ":via(File::BOM)";
                    }
                }
                close $fh;
            }
        }

lib/Text/AutoCSV.pm view on Meta::CPAN


        # out_encoding option takes precedence
        $enc = $self->{out_encoding} if defined( $self->{out_encoding} );
        my $m = ":encoding($enc)";
        if ( _is_utf8($enc) and $self->{out_utf8_bom} ) {
            $m .= ':via(File::BOM)';
        }

        if ( $OS_IS_PLAIN_WINDOWS and $FIX_PERLMONKS_823214 ) {

  # Tested with UTF-16LE, NOT tested with UTF-16BE (it should be the same story)

lib/Text/AutoCSV.pm view on Meta::CPAN

    my $csv = Text::AutoCSV->new(in_file => 'in.csv', out_file => 'out.csv',
        out_encoding => 'UTF-16');

=item out_utf8_bom

Enforce BOM (Byte-Order-Mark) on output, when it is UTF8. If output encoding is
not UTF-8, this attribute is ignored.

B<NOTE>

UTF-8 needs no BOM (there is no Byte-Order in UTF-8), and in practice,
UTF8-encoded files rarely have a BOM.

Using this attribute is not recommended. It is provided for the sake of
completeness, and also to produce Unicode files Microsoft EXCEL will be happy to
read.

lib/Text/AutoCSV.pm view on Meta::CPAN

like this:

    out_encoding => 'UTF-16'

But... While EXCEL will identify UTF-16 and read it as such, it will not take
into account the BOM found at the beginning. In the end the first cell will have
2 useless characters prepended. The only solution the author knows to workaround
this issue if to use UTF-8 as output encoding, and enforce a BOM. That is, use:

    ..., out_encoding => 'UTF-8', out_utf8_bom => 1, ...

=item out_sep_char

view all matches for this distribution

Text-BibTeX

1 match

view release on metacpan or search on metacpan

lib/Text/BibTeX/File.pm view on Meta::CPAN

=item BINMODE

By default, Text::BibTeX uses bytes directly. Thus, you need to encode
strings accordingly with the encoding of the files you are reading. You can
also select UTF-8. In this case, Text::BibTeX will return UTF-8 strings in
NFC mode. Note that at the moment files with BOM are not supported.

Valid values are 'raw/bytes' or 'utf-8'.

=item NORMALIZATION

view all matches for this distribution

Text-CGILike

1 match

view release on metacpan or search on metacpan

t/000-report-versions.t view on Meta::CPAN

        return $self->_error("Did not provide a string to load");
    }

    # Byte order marks
    # NOTE: Keeping this here to educate maintainers
    # my %BOM = (
    #     "\357\273\277" => 'UTF-8',
    #     "\376\377"     => 'UTF-16BE',
    #     "\377\376"     => 'UTF-16LE',
    #     "\377\376\0\0" => 'UTF-32LE'
    #     "\0\0\376\377" => 'UTF-32BE',
    # );
    if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
        return $self->_error("Stream has a non UTF-8 BOM");
    }
    else {
        # Strip UTF-8 bom if found, we'll just ignore it
        $string =~ s/^\357\273\277//;
    }

view all matches for this distribution

Text-CSV

6 results

view release on metacpan or search on metacpan

lib/Text/CSV.pm view on Meta::CPAN


=head2 Unicode

Unicode is only tested to work with perl-5.8.2 and up.

See also L</BOM>.

The simplest way to ensure the correct encoding is used for  in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.

lib/Text/CSV.pm view on Meta::CPAN


 $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
 # combine () and print () accept UTF8 marked data
 # parse () and getline () return UTF8 marked data

=head2 BOM

BOM  (or Byte Order Mark)  handling is available only inside the L</header>
method.   This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.

If a file has a BOM, the easiest way to deal with that is

 my $aoh = csv (in => $file, detect_bom => 1);

All records will be encoded based on the detected BOM.

This implies a call to the  L</header>  method,  which defaults to also set
the L</column_names>. So this is B<not> the same as

 my $aoh = csv (in => $file, headers => "auto");

which only reads the first record to set  L</column_names>  but ignores any
meaning of possible present BOM.

=head1 METHODS

This section is also taken from Text::CSV_XS.

lib/Text/CSV.pm view on Meta::CPAN


=item detect_bom

 $csv->header ($fh, { detect_bom => 1 });

The default behavior is to detect if the header line starts with a BOM.  If
the header has a BOM, use that to set the encoding of C<$fh>.  This default
behavior can be disabled by passing a false value to C<detect_bom>.

Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,  and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1,  and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.

If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.

 my $enc = $csv->{ENCODING};

The encoding is used with C<binmode> on C<$fh>.

If the handle was opened in a (correct) encoding,  this method will  B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.

=item munge_column_names

This option offers the means to modify the column names into something that

lib/Text/CSV.pm view on Meta::CPAN

to C<open>. There is no default value. This attribute does not work in perl
5.6.x.  C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.

If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly.   This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.

Encodings can be stacked, as supported by C<binmode>:

lib/Text/CSV.pm view on Meta::CPAN

 $aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");

=head3 detect_bom

If  C<detect_bom>  is given, the method  L</header>  will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.

C<detect_bom> can be abbreviated to C<bom>.

This is the same as setting L<C<encoding>|/encoding> to C<"auto">.

view all matches for this distribution

Text-CSV_XS

8 results

view release on metacpan or search on metacpan

CSV_XS.pm view on Meta::CPAN


=head2 Unicode

Unicode is only tested to work with perl-5.8.2 and up.

See also L</BOM>.

The simplest way to ensure the correct encoding is used for  in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.

CSV_XS.pm view on Meta::CPAN


 $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
 # combine () and print () accept UTF8 marked data
 # parse () and getline () return UTF8 marked data

=head2 BOM

BOM  (or Byte Order Mark)  handling is available only inside the L</header>
method.   This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.

If a file has a BOM, the easiest way to deal with that is

 my $aoh = csv (in => $file, detect_bom => 1);

All records will be encoded based on the detected BOM.

This implies a call to the  L</header>  method,  which defaults to also set
the L</column_names>. So this is B<not> the same as

 my $aoh = csv (in => $file, headers => "auto");

which only reads the first record to set  L</column_names>  but ignores any
meaning of possible present BOM.

=head1 SPECIFICATION

While no formal specification for CSV exists, L<RFC 4180|https://datatracker.ietf.org/doc/html/rfc4180>
(I<1>) describes the common format and establishes  C<text/csv> as the MIME

CSV_XS.pm view on Meta::CPAN

=item detect_bom
X<detect_bom>

 $csv->header ($fh, { detect_bom => 1 });

The default behavior is to detect if the header line starts with a BOM.  If
the header has a BOM, use that to set the encoding of C<$fh>.  This default
behavior can be disabled by passing a false value to C<detect_bom>.

Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,  and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1,  and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.

If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.

 my $enc = $csv->{ENCODING};

The encoding is used with C<binmode> on C<$fh>.

If the handle was opened in a (correct) encoding,  this method will  B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.

=item munge_column_names
X<munge_column_names>

CSV_XS.pm view on Meta::CPAN

to C<open>. There is no default value. This attribute does not work in perl
5.6.x.  C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.

If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly.   This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.

Encodings can be stacked, as supported by C<binmode>:

CSV_XS.pm view on Meta::CPAN


=head3 detect_bom
X<detect_bom>

If  C<detect_bom>  is given, the method  L</header>  will be invoked on the
opened stream to check if there is a  BOM and set the encoding accordingly.
Note that the attribute L<C<headers>|/headers>  can be used to overrule the
default behavior of how that method automatically sets the attribute.

C<detect_bom> can be abbreviated to C<bom>.

CSV_XS.pm view on Meta::CPAN


 use Text::CSV_XS qw( csv );
 csv (in => csv (in => "bad.csv", sep_char => ";"), out => *STDOUT);

As C<STDOUT> is now default in L</csv>, a one-liner converting a UTF-16 CSV
file with BOM and TAB-separation to valid UTF-8 CSV could be:

 $ perl -C3 -MText::CSV_XS=csv -we\
    'csv(in=>"utf16tab.csv",encoding=>"utf16",sep=>"\t")' >utf8.csv

=head3 Unifying EOL

CSV_XS.pm view on Meta::CPAN


A script to rewrite (in)valid CSV into valid CSV files.  Script has options
to generate confusing CSV files or CSV files that conform to Dutch MS-Excel
exports (using C<;> as separation).

Script - by default - honors BOM  and auto-detects separation converting it
to default standard CSV with C<,> as separator.

=back

=head1 CAVEATS

view all matches for this distribution

Text-Conversation

1 match

view release on metacpan or search on metacpan

t/000-report-versions.t view on Meta::CPAN

        return $self->_error("Did not provide a string to load");
    }

    # Byte order marks
    # NOTE: Keeping this here to educate maintainers
    # my %BOM = (
    #     "\357\273\277" => 'UTF-8',
    #     "\376\377"     => 'UTF-16BE',
    #     "\377\376"     => 'UTF-16LE',
    #     "\377\376\0\0" => 'UTF-32LE'
    #     "\0\0\376\377" => 'UTF-32BE',
    # );
    if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
        return $self->_error("Stream has a non UTF-8 BOM");
    } else {
        # Strip UTF-8 bom if found, we'll just ignore it
        $string =~ s/^\357\273\277//;
    }

view all matches for this distribution

Text-Filter

1 match

view release on metacpan or search on metacpan

lib/Text/Filter/Cooked.pm view on Meta::CPAN

Lines that start with a custom defined comment symbol are ignored.

=back

=for later
On top of this, if the input file starts with a Unicode BOM, the input
will be correctly decoded into Perl internal format. It is also
possible to change the encoding used in a single file as often as
desired. See L<INPUT ENCODING>.

Text::Filter::Cooked is based on Text::Filter, see L<Text::Filter>.

lib/Text/Filter/Cooked.pm view on Meta::CPAN


=begin later

	my $ienc = $self->get_input_encoding;
	if ( $ienc && ! defined $self->_get_lineno ) {
	    # Detecting BOM...
	    if ( substr($line, 0, 2) eq "\xff\xfe" ) {
		# Found BOM (BE)
		$line = substr($line, 2);
		$self->set_input_encoding($ienc = "utf-16-be");
	    }
	    elsif ( substr($line, 0, 2) eq "\xfe\xff" ) {
		# Found BOM (LE)
		$line = substr($line, 2);
		$self->set_input_encoding($ienc = "utf-16le");
	    }
	}

lib/Text/Filter/Cooked.pm view on Meta::CPAN

have arbitrary character encodings.

If the C<input_encoding> attribute is set, the input data is assumed
to be in the specified encoding.

If the input file starts with a Unicode BOM marker, it will be
considered UTF-16 and decoded accordingly.

If the file contains a comment record with non-comment contents of the
form

view all matches for this distribution

Text-JavE

1 match

view release on metacpan or search on metacpan

t.jmov view on Meta::CPAN

J:A71 30 %0%0%0%0%0%0%0%0%0%39 nger th%0%14 MM%21 crowing roos%0%14 oo%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:66
J:A71 30 %0%0%0%0%0%0%0%0%0%41 er%0%14 MM%25 ing r%0%14 oo%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%0%14 MM%0%14 oo%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:1000
J:A71 30 %0%0%0%0%0%0%0%0%4 \ /%0%4 /(\%21 The%0%6 )%7 MM%0%5 (%8 oo%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:66
J:A71 30 %0%0%0%0%0%0%0%0%4 ` '%0%4 ,(.%21 The ner%0%6 )%7 MM%0%5 (%8 oo%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 \ /%0%4 /(\%21 The nerves%0%6 )%7 MM%0%5 (%8 Oo%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 ` '%0%4 ,(.%21 The nerves%0%6 )%7 MM%12 of%0%5 (%8 Oo%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 ` /%0%4 /(.%21 The nerves%0%6 )%7 MM%12 of a%0%5 (%8 OO%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 \ '%0%4 ,(\%21 The nerves%0%6 )%7 MM%12 of a cluc%0%5 (%8 OO%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 \ '%0%4 ,(\%21 The nerves%0%6 )%7 MM%12 of a clucking%0%5 (%8 Oo%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 ` /%0%4 /(.%21 The nerves%0%6 )%7 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 ` /%20 The nerves%0%5 /).%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 \ '%20 The nerves%0%5 ,)\%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 \ /%20 The nerves%0%5 /)\%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 ` '%20 The nerves%0%5 ,).%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ /%7 MM%12 of a clucking%0%4 /(\%7 oO%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ '%7 MM%12 of a clucking%0%4 ,(\%7 OO%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ /%7 MM%12 of a clucking%0%4 /(\%7 oO%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 ,(.%7 oO%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` /%7 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 ,(\%7 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 /(.%7 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ '%7 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB%7 -/%0%5 --%6 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oo%12 hen%0%3 ,-'-.%5 <%0  (BOMB%7 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oo%12 hen%0%4 -%8 <%0  (BOM%8 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oo%12 hen%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%4 #phew#%4 oO%12 hen%0%10 \  <%0%14 =/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:1000
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oO%12 hen%0%13 <%0%14 =/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:200

view all matches for this distribution

Text-LTSV

2 results

view release on metacpan or search on metacpan

cover_db/digests view on Meta::CPAN

{"e607c6dfd4cc596889d41737d3c52f65":"/Users/naoya/perl5/perlbrew/perls/perl-5.17.6/lib/5.17.6/warnings/register.pm","fee12004fe12f764cfeb5aa8850b5d81":"/Users/naoya/perl5/perlbrew/perls/perl-5.17.6/lib/site_perl/5.17.6/PPI/Token/DashedWord.pm","01f1d...

view all matches for this distribution

Text-Markdown-Discount

1 match

view release on metacpan or search on metacpan

lib/Text/Markdown/ppport.h view on Meta::CPAN

BOL_t8_p8|5.033003||Viu
BOL_t8_pb|5.033003||Viu
BOL_tb|5.035004||Viu
BOL_tb_p8|5.033003||Viu
BOL_tb_pb|5.033003||Viu
BOM_UTF8|5.025005|5.003007|p
BOM_UTF8_FIRST_BYTE|5.019004||Viu
BOM_UTF8_TAIL|5.019004||Viu
boolSV|5.004000|5.003007|p
boot_core_builtin|5.035007||Viu
boot_core_mro|5.009005||Viu
boot_core_PerlIO|5.007002||Viu
boot_core_UNIVERSAL|5.003007||Viu

lib/Text/Markdown/ppport.h view on Meta::CPAN

#endif

#endif

#if 'A' == 65
#ifndef BOM_UTF8
#  define BOM_UTF8                       "\xEF\xBB\xBF"
#endif

#ifndef REPLACEMENT_CHARACTER_UTF8
#  define REPLACEMENT_CHARACTER_UTF8     "\xEF\xBF\xBD"
#endif

#elif '^' == 95
#ifndef BOM_UTF8
#  define BOM_UTF8                       "\xDD\x73\x66\x73"
#endif

#ifndef REPLACEMENT_CHARACTER_UTF8
#  define REPLACEMENT_CHARACTER_UTF8     "\xDD\x73\x73\x71"
#endif

#elif '^' == 176
#ifndef BOM_UTF8
#  define BOM_UTF8                       "\xDD\x72\x65\x72"
#endif

#ifndef REPLACEMENT_CHARACTER_UTF8
#  define REPLACEMENT_CHARACTER_UTF8     "\xDD\x72\x72\x70"
#endif

view all matches for this distribution

Text-Markdown-Hoedown

1 match

view release on metacpan or search on metacpan

hoedown/src/document.c view on Meta::CPAN

}

void
hoedown_document_render(hoedown_document *doc, hoedown_buffer *ob, const uint8_t *data, size_t size)
{
	static const uint8_t UTF8_BOM[] = {0xEF, 0xBB, 0xBF};

	hoedown_buffer *text;
	size_t beg, end;

	int footnotes_enabled;

hoedown/src/document.c view on Meta::CPAN

	}

	/* first pass: looking for references, copying everything else */
	beg = 0;

	/* Skip a possible UTF-8 BOM, even though the Unicode standard
	 * discourages having these in UTF-8 documents */
	if (size >= 3 && memcmp(data, UTF8_BOM, 3) == 0)
		beg += 3;

	while (beg < size) /* iterating over lines */
		if (footnotes_enabled && is_footnote(data, beg, size, &end, &doc->footnotes_found))
			beg = end;

view all matches for this distribution

Text-Markup

17 results

view release on metacpan or search on metacpan

lib/Text/Markup.pm view on Meta::CPAN


=item C<encoding>

The character encoding to assume the source file is encoded in (if such cannot
be determined by other means, such as a
L<BOM|https://en.wikipedia.org/wiki/Byte_order_mark>). If not specified, the
value of the C<default_encoding> attribute will be used, and if that attribute
is not set, UTF-8 will be assumed.

=item C<options>

lib/Text/Markup.pm view on Meta::CPAN

  package Text::Markup::FooBar;

  use 5.8.1;
  use strict;
  use Text::FooBar ();
  use File::BOM qw(open_bom)

  sub import {
      # Replace the regex if passed one.
      Text::Markup->register( foobar => $_[1] ) if $_[1];
  }

lib/Text/Markup.pm view on Meta::CPAN

      );
  }

Use the C<$encoding> argument as appropriate to read in the source file. If
your parser requires that text be decoded to Perl's internal form, use of
L<File::BOM> is recommended, so that an explicit BOM will determine the
encoding. Otherwise, fall back on the specified encoding. Note that some
parsers, such as an HTML parser, would want text encoded before it parsed it.
In such a case, read in the file as raw bytes:

      open my $fh, '<:raw', $file or die "Cannot open $file: $!\n";

view all matches for this distribution

Text-Minify-XS

4 results

view release on metacpan or search on metacpan

lib/Text/Minify/XS.pm view on Meta::CPAN


=head1 VERSION

version v0.7.7

=for stopwords BOM minify minifier

=head1 SYNOPSIS

  use Text::Minify::XS qw/ minify /;

lib/Text/Minify/XS.pm view on Meta::CPAN

memory overflows. You should ensure that the input string is properly
encoded as UTF-8.

=head2 Byte Order Marks

The Byte Order Mark (BOM) at the beginning of a file will not be removed. That is because the minifier does not know
this is the beginning of a file or not.

=head1 SECURITY CONSIDERATIONS

Passing malformed UTF-8 characters may throw an exception, which in some cases could lead to a denial of service if

view all matches for this distribution

Text-Pipe-Encoding

1 match

view release on metacpan or search on metacpan

t/000-report-versions.t view on Meta::CPAN

        return $self->_error("Did not provide a string to load");
    }

    # Byte order marks
    # NOTE: Keeping this here to educate maintainers
    # my %BOM = (
    #     "\357\273\277" => 'UTF-8',
    #     "\376\377"     => 'UTF-16BE',
    #     "\377\376"     => 'UTF-16LE',
    #     "\377\376\0\0" => 'UTF-32LE'
    #     "\0\0\376\377" => 'UTF-32BE',
    # );
    if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
        return $self->_error("Stream has a non UTF-8 BOM");
    } else {
        # Strip UTF-8 bom if found, we'll just ignore it
        $string =~ s/^\357\273\277//;
    }

view all matches for this distribution

Text-Pipe-HTML

1 match

view release on metacpan or search on metacpan

t/000-report-versions.t view on Meta::CPAN

		return $self->_error("Did not provide a string to load");
	}

	# Byte order marks
	# NOTE: Keeping this here to educate maintainers
	# my %BOM = (
	#     "\357\273\277" => 'UTF-8',
	#     "\376\377"     => 'UTF-16BE',
	#     "\377\376"     => 'UTF-16LE',
	#     "\377\376\0\0" => 'UTF-32LE'
	#     "\0\0\376\377" => 'UTF-32BE',
	# );
	if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
		return $self->_error("Stream has a non UTF-8 BOM");
	} else {
		# Strip UTF-8 bom if found, we'll just ignore it
		$string =~ s/^\357\273\277//;
	}

view all matches for this distribution

Text-Pipe-Translate

1 match

view release on metacpan or search on metacpan

t/000-report-versions.t view on Meta::CPAN

		return $self->_error("Did not provide a string to load");
	}

	# Byte order marks
	# NOTE: Keeping this here to educate maintainers
	# my %BOM = (
	#     "\357\273\277" => 'UTF-8',
	#     "\376\377"     => 'UTF-16BE',
	#     "\377\376"     => 'UTF-16LE',
	#     "\377\376\0\0" => 'UTF-32LE'
	#     "\0\0\376\377" => 'UTF-32BE',
	# );
	if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
		return $self->_error("Stream has a non UTF-8 BOM");
	} else {
		# Strip UTF-8 bom if found, we'll just ignore it
		$string =~ s/^\357\273\277//;
	}

view all matches for this distribution

Text-Pipe-W3CDTF

1 match

view release on metacpan or search on metacpan

t/000-report-versions.t view on Meta::CPAN

		return $self->_error("Did not provide a string to load");
	}

	# Byte order marks
	# NOTE: Keeping this here to educate maintainers
	# my %BOM = (
	#     "\357\273\277" => 'UTF-8',
	#     "\376\377"     => 'UTF-16BE',
	#     "\377\376"     => 'UTF-16LE',
	#     "\377\376\0\0" => 'UTF-32LE'
	#     "\0\0\376\377" => 'UTF-32BE',
	# );
	if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
		return $self->_error("Stream has a non UTF-8 BOM");
	} else {
		# Strip UTF-8 bom if found, we'll just ignore it
		$string =~ s/^\357\273\277//;
	}

view all matches for this distribution

Text-Restructured

4 results

view release on metacpan or search on metacpan

doc/src/latest/reStructuredText.html view on Meta::CPAN

</li>
</ol>
<p>For example, none of the following are recognized as containing inline
markup start-strings:</p>
<ul class="simple">
<li>asterisks: * "*" '*' (*) (* [*] {*} 1*x BOM32_*</li>
<li>double asterisks: **  a**b O(N**2) etc.</li>
<li>backquotes: ` `` etc.</li>
<li>underscores: _ __ __init__ __init__() etc.</li>
<li>vertical bars: | || etc.</li>
</ul>

view all matches for this distribution

Text-SRT-Align

3 results

view release on metacpan or search on metacpan

META.yml view on Meta::CPAN

    - test
requires:
  Clone: 0
  Encode: 0
  Encode::Locale: 0
  File::BOM: 0
  File::ShareDir: 0
  FindBin: 0
  Getopt::Std: 0
  IO::File: 0
  IPC::Open3: 0

view all matches for this distribution

Text-Summarizer

1 match

view release on metacpan or search on metacpan

Corpus/written/newspaper:newswire/NYTnewswire9.txt view on Meta::CPAN


 NYT20020731.0191 
 2002-07-31 18:45 

A4537 &Cx1f; tad-z
u a BC-ATOMIC-BOMB-0801-COX       07-31 1808


 BC-ATOMIC-BOMB-0801-COX 
  
 A mission to end a war  
 By DENISE GAMINO  
   Cox News Service

view all matches for this distribution

Text-TinySegmenter

1 match

view release on metacpan or search on metacpan

lib/Text/TinySegmenter.pm view on Meta::CPAN

my %TC2 = ("HHO" => 2088,"HII" => -1023,"HMM" => -1154,"IHI" => -1965,"KKH" => 703,"OII" => -2649);
my %TC3 = ("AAA" => -294,"HHH" => 346,"HHI" => -341,"HII" => -1088,"HIK" => 731,"HOH" => -1486,"IHH" => 128,"IHI" => -3041,"IHO" => -1935,"IIH" => -825,"IIM" => -1035,"IOI" => -542,"KHH" => -1216,"KKA" => 491,"KKH" => -1217,"KOK" => -1009,"MHH" => -2...
my %TC4 = ("HHH" => -203,"HHI" => 1344,"HHK" => 365,"HHM" => -122,"HHN" => 182,"HHO" => 669,"HIH" => 804,"HII" => 679,"HOH" => 446,"IHH" => 695,"IHO" => -2324,"IIH" => 321,"III" => 1497,"IIO" => 656,"IOO" => 54,"KAK" => 4845,"KKA" => 3386,"KKK" => 30...
my %TQ1 = ("BHHH" => -227,"BHHI" => 316,"BHIH" => -132,"BIHH" => 60,"BIII" => 1595,"BNHH" => -744,"BOHH" => 225,"BOOO" => -908,"OAKK" => 482,"OHHH" => 281,"OHIH" => 249,"OIHI" => 200,"OIIH" => -68);
my %TQ2 = ("BIHH" => -1401,"BIII" => -1033,"BKAK" => -543,"BOOO" => -5591);
my %TQ3 = ("BHHH" => 478,"BHHM" => -1073,"BHIH" => 222,"BHII" => -504,"BIIH" => -116,"BIII" => -105,"BMHI" => -863,"BMHM" => -464,"BOMH" => 620,"OHHH" => 346,"OHHI" => 1729,"OHII" => 997,"OHMH" => 481,"OIHH" => 623,"OIIH" => 1344,"OKAK" => 2792,"OKHH...
my %TQ4 = ("BHHH" => -721,"BHHM" => -3604,"BHII" => -966,"BIIH" => -607,"BIII" => -2181,"OAAA" => -2763,"OAKK" => 180,"OHHH" => -294,"OHHI" => 2446,"OHHO" => 480,"OHIH" => -1573,"OIHH" => 1935,"OIHI" => -493,"OIIH" => 626,"OIII" => -4007,"OKAK" => -8...
my %TW1 = ("ã«ã¤ã„" => -4681,"æ±äº¬éƒ½" => 2026);
my %TW2 = ("ã‚ã‚‹ç¨‹" => -2049,"ã„ã£ãŸ" => -1256,"ã“ã‚ãŒ" => -2434,"ã—ã‚‡ã†" => 3873,"ãã®å¾Œ" => -4430,"ã ã£ã¦" => -1049,"ã¦ã„ãŸ" => 1833,"ã¨ã—ã¦" => -4657,"ã¨ã‚‚ã«" => -4517,"ã‚‚ã®ã§" => 1882,"ä¸€æ°—ã«" => -792,"åˆã‚ã¦" ...
my %TW3 = ("ã„ãŸã " => -1734,"ã—ã¦ã„" => 1314,"ã¨ã—ã¦" => -4314,"ã«ã¤ã„" => -5483,"ã«ã¨ã£" => -5989,"ã«å½“ãŸ" => -6247,"ã®ã§," => -727,"ã®ã§ã€" => -727,"ã®ã‚‚ã®" => -600,"ã‚Œã‹ã‚‰" => -3752,"åäºŒæœˆ" => -2287);
my %TW4 = ("ã„ã†." => 8576,"ã„ã†ã€‚" => 8576,"ã‹ã‚‰ãª" => -2348,"ã—ã¦ã„" => 2958,"ãŸãŒ," => 1516,"ãŸãŒã€" => 1516,"ã¦ã„ã‚‹" => 1538,"ã¨ã„ã†" => 1349,"ã¾ã—ãŸ" => 5543,"ã¾ã›ã‚“" => 1097,"ã‚ˆã†ã¨" => -4258,"ã‚ˆã‚‹ã¨" => 5865);

view all matches for this distribution

Text-UnAbbrev

1 match

view release on metacpan or search on metacpan

share/en_US/technology/information view on Meta::CPAN

BMP        Basic Multilingual Plane
BNC        Bayonet Neill-Concelman
BOFH       bastard operator from hell
BOHICA     bend over here it comes again
BOINC      Berkeley Open Infrastructure for Network Computing
BOM        Byte Order Mark
BOOTP      Bootstrap Protocol
BPDU       Bridge Protocol Data Unit
BPEL       Business Process Execution Language
BPL        Broadband over Power Lines
BPS        bits per second

view all matches for this distribution

Text-VimColor

1 match

view release on metacpan or search on metacpan

t/encoding.t view on Meta::CPAN

env_compare [qw(utf8 c)] => 'ascii is fine',
  sub { string("( hi )\n") }, tvc_html('hi');

# FIXME: These don't work on a lot of smokers (particularly FreeBSD),
# (even ones with vim 7.2 +multi_byte).
# We could do a simple check to see if BOM recognition works as we expect it
# and only then perform this test, because really we only want to test that
# we haven't broken this functionality in environments where it already worked.
# Aside from lots of failing reports, see also rt-92601.

TODO: {

  local $TODO = 'Do simpler pre-tests to determine if these tests should pass in this evironment.';

env_compare utf8 => 'use BOM to get vim to honor encoded text',
  sub { prepend_bom($filetype, $input) }, $html;

env_compare utf8 => 'specify encoding by adding "+set fenc=..." to vim_options',
  sub { pass_vim_options(undef, $input, {filetype => $filetype}) }, $html;

t/encoding.t view on Meta::CPAN

  my ($lang, $str) = @_;
  # MORITZ/App-Mowyw-v0.7.1/lib/App/Mowyw.pm#L566
  {
    # any encoding will do if vim automatically detects it
    my $vim_encoding = 'utf-8';
    my $BOM = "\x{feff}";
    my $syn = Text::VimColor->new(
            filetype    => $lang,
            string      => encode($vim_encoding, $BOM . $str),
            );
    $str = decode($vim_encoding, $syn->html);
    $str =~ s/^$BOM//;
    return $str;
  };
}

sub pass_vim_options {

view all matches for this distribution

Text-VisualWidth

1 match

view release on metacpan or search on metacpan

VisualWidth.xs view on Meta::CPAN


int count_single_char_utf8( const unsigned char** pos, int* byte ){
  *byte = 0;
  if( **pos == 0 ) return 0;
  if( **pos == 0xef && *((*pos)+1) == 0xbb && *((*pos)+2) == 0xbf ){
    // BOM
    (*pos)+= 3;
    (*byte)+= 3;
//    printf("BOM\n");
    return 0;
  } else if( ( **pos & 0xe0 ) == 0xc0 && ( ( *((*pos)+1) & 0xc0 ) == 0x80 ) ){
    (*pos)+= 2;
    (*byte)+= 2;
//    printf("2byte\n");

view all matches for this distribution