view release on metacpan or search on metacpan
lib/Text/AutoCSV.pm view on Meta::CPAN
use Carp;
use Params::Validate qw(validate validate_pos :types);
use List::MoreUtils qw(first_index indexes);
use Fcntl qw(SEEK_SET);
use File::BOM;
use Text::CSV;
use DateTime;
# DateTime::Format::Strptime 1.70 does not work properly with us.
# Actually all version as of 1.63 are fine, except 1.70.
lib/Text/AutoCSV.pm view on Meta::CPAN
if ( open my $fh, '<:raw', $in_file ) {
my $bom;
read $fh, $bom, 3;
if ( length($bom) == 3 and $bom eq "\xef\xbb\xbf" ) {
if ( !defined($via) ) {
$m .= ":via(File::BOM)";
}
}
close $fh;
}
}
lib/Text/AutoCSV.pm view on Meta::CPAN
# out_encoding option takes precedence
$enc = $self->{out_encoding} if defined( $self->{out_encoding} );
my $m = ":encoding($enc)";
if ( _is_utf8($enc) and $self->{out_utf8_bom} ) {
$m .= ':via(File::BOM)';
}
if ( $OS_IS_PLAIN_WINDOWS and $FIX_PERLMONKS_823214 ) {
# Tested with UTF-16LE, NOT tested with UTF-16BE (it should be the same story)
lib/Text/AutoCSV.pm view on Meta::CPAN
my $csv = Text::AutoCSV->new(in_file => 'in.csv', out_file => 'out.csv',
out_encoding => 'UTF-16');
=item out_utf8_bom
Enforce BOM (Byte-Order-Mark) on output, when it is UTF8. If output encoding is
not UTF-8, this attribute is ignored.
B<NOTE>
UTF-8 needs no BOM (there is no Byte-Order in UTF-8), and in practice,
UTF8-encoded files rarely have a BOM.
Using this attribute is not recommended. It is provided for the sake of
completeness, and also to produce Unicode files Microsoft EXCEL will be happy to
read.
lib/Text/AutoCSV.pm view on Meta::CPAN
like this:
out_encoding => 'UTF-16'
But... While EXCEL will identify UTF-16 and read it as such, it will not take
into account the BOM found at the beginning. In the end the first cell will have
2 useless characters prepended. The only solution the author knows to workaround
this issue if to use UTF-8 as output encoding, and enforce a BOM. That is, use:
..., out_encoding => 'UTF-8', out_utf8_bom => 1, ...
=item out_sep_char
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/BibTeX/File.pm view on Meta::CPAN
=item BINMODE
By default, Text::BibTeX uses bytes directly. Thus, you need to encode
strings accordingly with the encoding of the files you are reading. You can
also select UTF-8. In this case, Text::BibTeX will return UTF-8 strings in
NFC mode. Note that at the moment files with BOM are not supported.
Valid values are 'raw/bytes' or 'utf-8'.
=item NORMALIZATION
view all matches for this distribution
view release on metacpan or search on metacpan
t/000-report-versions.t view on Meta::CPAN
return $self->_error("Did not provide a string to load");
}
# Byte order marks
# NOTE: Keeping this here to educate maintainers
# my %BOM = (
# "\357\273\277" => 'UTF-8',
# "\376\377" => 'UTF-16BE',
# "\377\376" => 'UTF-16LE',
# "\377\376\0\0" => 'UTF-32LE'
# "\0\0\376\377" => 'UTF-32BE',
# );
if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
return $self->_error("Stream has a non UTF-8 BOM");
}
else {
# Strip UTF-8 bom if found, we'll just ignore it
$string =~ s/^\357\273\277//;
}
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/CSV.pm view on Meta::CPAN
=head2 Unicode
Unicode is only tested to work with perl-5.8.2 and up.
See also L</BOM>.
The simplest way to ensure the correct encoding is used for in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.
lib/Text/CSV.pm view on Meta::CPAN
$csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
=head2 BOM
BOM (or Byte Order Mark) handling is available only inside the L</header>
method. This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.
If a file has a BOM, the easiest way to deal with that is
my $aoh = csv (in => $file, detect_bom => 1);
All records will be encoded based on the detected BOM.
This implies a call to the L</header> method, which defaults to also set
the L</column_names>. So this is B<not> the same as
my $aoh = csv (in => $file, headers => "auto");
which only reads the first record to set L</column_names> but ignores any
meaning of possible present BOM.
=head1 METHODS
This section is also taken from Text::CSV_XS.
lib/Text/CSV.pm view on Meta::CPAN
=item detect_bom
$csv->header ($fh, { detect_bom => 1 });
The default behavior is to detect if the header line starts with a BOM. If
the header has a BOM, use that to set the encoding of C<$fh>. This default
behavior can be disabled by passing a false value to C<detect_bom>.
Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1, and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.
If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.
my $enc = $csv->{ENCODING};
The encoding is used with C<binmode> on C<$fh>.
If the handle was opened in a (correct) encoding, this method will B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.
=item munge_column_names
This option offers the means to modify the column names into something that
lib/Text/CSV.pm view on Meta::CPAN
to C<open>. There is no default value. This attribute does not work in perl
5.6.x. C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.
If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly. This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.
Encodings can be stacked, as supported by C<binmode>:
lib/Text/CSV.pm view on Meta::CPAN
$aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");
=head3 detect_bom
If C<detect_bom> is given, the method L</header> will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.
C<detect_bom> can be abbreviated to C<bom>.
This is the same as setting L<C<encoding>|/encoding> to C<"auto">.
view all matches for this distribution
view release on metacpan or search on metacpan
=head2 Unicode
Unicode is only tested to work with perl-5.8.2 and up.
See also L</BOM>.
The simplest way to ensure the correct encoding is used for in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.
$csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
=head2 BOM
BOM (or Byte Order Mark) handling is available only inside the L</header>
method. This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.
If a file has a BOM, the easiest way to deal with that is
my $aoh = csv (in => $file, detect_bom => 1);
All records will be encoded based on the detected BOM.
This implies a call to the L</header> method, which defaults to also set
the L</column_names>. So this is B<not> the same as
my $aoh = csv (in => $file, headers => "auto");
which only reads the first record to set L</column_names> but ignores any
meaning of possible present BOM.
=head1 SPECIFICATION
While no formal specification for CSV exists, L<RFC 4180|https://datatracker.ietf.org/doc/html/rfc4180>
(I<1>) describes the common format and establishes C<text/csv> as the MIME
=item detect_bom
X<detect_bom>
$csv->header ($fh, { detect_bom => 1 });
The default behavior is to detect if the header line starts with a BOM. If
the header has a BOM, use that to set the encoding of C<$fh>. This default
behavior can be disabled by passing a false value to C<detect_bom>.
Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1, and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.
If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.
my $enc = $csv->{ENCODING};
The encoding is used with C<binmode> on C<$fh>.
If the handle was opened in a (correct) encoding, this method will B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.
=item munge_column_names
X<munge_column_names>
to C<open>. There is no default value. This attribute does not work in perl
5.6.x. C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.
If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly. This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.
Encodings can be stacked, as supported by C<binmode>:
=head3 detect_bom
X<detect_bom>
If C<detect_bom> is given, the method L</header> will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.
Note that the attribute L<C<headers>|/headers> can be used to overrule the
default behavior of how that method automatically sets the attribute.
C<detect_bom> can be abbreviated to C<bom>.
use Text::CSV_XS qw( csv );
csv (in => csv (in => "bad.csv", sep_char => ";"), out => *STDOUT);
As C<STDOUT> is now default in L</csv>, a one-liner converting a UTF-16 CSV
file with BOM and TAB-separation to valid UTF-8 CSV could be:
$ perl -C3 -MText::CSV_XS=csv -we\
'csv(in=>"utf16tab.csv",encoding=>"utf16",sep=>"\t")' >utf8.csv
=head3 Unifying EOL
A script to rewrite (in)valid CSV into valid CSV files. Script has options
to generate confusing CSV files or CSV files that conform to Dutch MS-Excel
exports (using C<;> as separation).
Script - by default - honors BOM and auto-detects separation converting it
to default standard CSV with C<,> as separator.
=back
=head1 CAVEATS
view all matches for this distribution
view release on metacpan or search on metacpan
t/000-report-versions.t view on Meta::CPAN
return $self->_error("Did not provide a string to load");
}
# Byte order marks
# NOTE: Keeping this here to educate maintainers
# my %BOM = (
# "\357\273\277" => 'UTF-8',
# "\376\377" => 'UTF-16BE',
# "\377\376" => 'UTF-16LE',
# "\377\376\0\0" => 'UTF-32LE'
# "\0\0\376\377" => 'UTF-32BE',
# );
if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
return $self->_error("Stream has a non UTF-8 BOM");
} else {
# Strip UTF-8 bom if found, we'll just ignore it
$string =~ s/^\357\273\277//;
}
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/Filter/Cooked.pm view on Meta::CPAN
Lines that start with a custom defined comment symbol are ignored.
=back
=for later
On top of this, if the input file starts with a Unicode BOM, the input
will be correctly decoded into Perl internal format. It is also
possible to change the encoding used in a single file as often as
desired. See L<INPUT ENCODING>.
Text::Filter::Cooked is based on Text::Filter, see L<Text::Filter>.
lib/Text/Filter/Cooked.pm view on Meta::CPAN
=begin later
my $ienc = $self->get_input_encoding;
if ( $ienc && ! defined $self->_get_lineno ) {
# Detecting BOM...
if ( substr($line, 0, 2) eq "\xff\xfe" ) {
# Found BOM (BE)
$line = substr($line, 2);
$self->set_input_encoding($ienc = "utf-16-be");
}
elsif ( substr($line, 0, 2) eq "\xfe\xff" ) {
# Found BOM (LE)
$line = substr($line, 2);
$self->set_input_encoding($ienc = "utf-16le");
}
}
lib/Text/Filter/Cooked.pm view on Meta::CPAN
have arbitrary character encodings.
If the C<input_encoding> attribute is set, the input data is assumed
to be in the specified encoding.
If the input file starts with a Unicode BOM marker, it will be
considered UTF-16 and decoded accordingly.
If the file contains a comment record with non-comment contents of the
form
view all matches for this distribution
view release on metacpan or search on metacpan
J:A71 30 %0%0%0%0%0%0%0%0%0%39 nger th%0%14 MM%21 crowing roos%0%14 oo%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:66
J:A71 30 %0%0%0%0%0%0%0%0%0%41 er%0%14 MM%25 ing r%0%14 oo%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%0%14 MM%0%14 oo%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:1000
J:A71 30 %0%0%0%0%0%0%0%0%4 \ /%0%4 /(\%21 The%0%6 )%7 MM%0%5 (%8 oo%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:66
J:A71 30 %0%0%0%0%0%0%0%0%4 ` '%0%4 ,(.%21 The ner%0%6 )%7 MM%0%5 (%8 oo%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 \ /%0%4 /(\%21 The nerves%0%6 )%7 MM%0%5 (%8 Oo%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 ` '%0%4 ,(.%21 The nerves%0%6 )%7 MM%12 of%0%5 (%8 Oo%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 ` /%0%4 /(.%21 The nerves%0%6 )%7 MM%12 of a%0%5 (%8 OO%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 \ '%0%4 ,(\%21 The nerves%0%6 )%7 MM%12 of a cluc%0%5 (%8 OO%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 \ '%0%4 ,(\%21 The nerves%0%6 )%7 MM%12 of a clucking%0%5 (%8 Oo%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%4 ` /%0%4 /(.%21 The nerves%0%6 )%7 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 ` /%20 The nerves%0%5 /).%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 \ '%20 The nerves%0%5 ,)\%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 \ /%20 The nerves%0%5 /)\%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%5 ` '%20 The nerves%0%5 ,).%6 MM%12 of a clucking%0%5 (%8 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ /%7 MM%12 of a clucking%0%4 /(\%7 oO%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ '%7 MM%12 of a clucking%0%4 ,(\%7 OO%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ /%7 MM%12 of a clucking%0%4 /(\%7 oO%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 ,(.%7 oO%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` /%7 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 ,(\%7 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 >/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 ` '%7 MM%12 of a clucking%0%4 /(.%7 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%4 \ '%7 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB )%5 -/%0%3 `%3-'%5 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%4 ,(.%7 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB%7 -/%0%5 --%6 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oo%12 hen%0%3 ,-'-.%5 <%0 (BOMB%7 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oo%12 hen%0%4 -%8 <%0 (BOM%8 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oo%12 hen%0%13 <%0%14 -/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%4 #phew#%4 oO%12 hen%0%10 \ <%0%14 =/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:1000
J:A71 30 %0%0%0%0%0%0%0%0%0%28 The nerves%0%14 MM%12 of a clucking%0%14 oO%12 hen%0%13 <%0%14 =/%0%13 /^\%0%12 : C :%0%12 8%3=8%0%13 |^|%0%12 .%3|%0%12 ~-~-
+:200
view all matches for this distribution
view release on metacpan or search on metacpan
cover_db/digests view on Meta::CPAN
{"e607c6dfd4cc596889d41737d3c52f65":"/Users/naoya/perl5/perlbrew/perls/perl-5.17.6/lib/5.17.6/warnings/register.pm","fee12004fe12f764cfeb5aa8850b5d81":"/Users/naoya/perl5/perlbrew/perls/perl-5.17.6/lib/site_perl/5.17.6/PPI/Token/DashedWord.pm","01f1d...
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/Markdown/ppport.h view on Meta::CPAN
BOL_t8_p8|5.033003||Viu
BOL_t8_pb|5.033003||Viu
BOL_tb|5.035004||Viu
BOL_tb_p8|5.033003||Viu
BOL_tb_pb|5.033003||Viu
BOM_UTF8|5.025005|5.003007|p
BOM_UTF8_FIRST_BYTE|5.019004||Viu
BOM_UTF8_TAIL|5.019004||Viu
boolSV|5.004000|5.003007|p
boot_core_builtin|5.035007||Viu
boot_core_mro|5.009005||Viu
boot_core_PerlIO|5.007002||Viu
boot_core_UNIVERSAL|5.003007||Viu
lib/Text/Markdown/ppport.h view on Meta::CPAN
#endif
#endif
#if 'A' == 65
#ifndef BOM_UTF8
# define BOM_UTF8 "\xEF\xBB\xBF"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
# define REPLACEMENT_CHARACTER_UTF8 "\xEF\xBF\xBD"
#endif
#elif '^' == 95
#ifndef BOM_UTF8
# define BOM_UTF8 "\xDD\x73\x66\x73"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
# define REPLACEMENT_CHARACTER_UTF8 "\xDD\x73\x73\x71"
#endif
#elif '^' == 176
#ifndef BOM_UTF8
# define BOM_UTF8 "\xDD\x72\x65\x72"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
# define REPLACEMENT_CHARACTER_UTF8 "\xDD\x72\x72\x70"
#endif
view all matches for this distribution
view release on metacpan or search on metacpan
hoedown/src/document.c view on Meta::CPAN
}
void
hoedown_document_render(hoedown_document *doc, hoedown_buffer *ob, const uint8_t *data, size_t size)
{
static const uint8_t UTF8_BOM[] = {0xEF, 0xBB, 0xBF};
hoedown_buffer *text;
size_t beg, end;
int footnotes_enabled;
hoedown/src/document.c view on Meta::CPAN
}
/* first pass: looking for references, copying everything else */
beg = 0;
/* Skip a possible UTF-8 BOM, even though the Unicode standard
* discourages having these in UTF-8 documents */
if (size >= 3 && memcmp(data, UTF8_BOM, 3) == 0)
beg += 3;
while (beg < size) /* iterating over lines */
if (footnotes_enabled && is_footnote(data, beg, size, &end, &doc->footnotes_found))
beg = end;
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/Markup.pm view on Meta::CPAN
=item C<encoding>
The character encoding to assume the source file is encoded in (if such cannot
be determined by other means, such as a
L<BOM|https://en.wikipedia.org/wiki/Byte_order_mark>). If not specified, the
value of the C<default_encoding> attribute will be used, and if that attribute
is not set, UTF-8 will be assumed.
=item C<options>
lib/Text/Markup.pm view on Meta::CPAN
package Text::Markup::FooBar;
use 5.8.1;
use strict;
use Text::FooBar ();
use File::BOM qw(open_bom)
sub import {
# Replace the regex if passed one.
Text::Markup->register( foobar => $_[1] ) if $_[1];
}
lib/Text/Markup.pm view on Meta::CPAN
);
}
Use the C<$encoding> argument as appropriate to read in the source file. If
your parser requires that text be decoded to Perl's internal form, use of
L<File::BOM> is recommended, so that an explicit BOM will determine the
encoding. Otherwise, fall back on the specified encoding. Note that some
parsers, such as an HTML parser, would want text encoded before it parsed it.
In such a case, read in the file as raw bytes:
open my $fh, '<:raw', $file or die "Cannot open $file: $!\n";
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/Minify/XS.pm view on Meta::CPAN
=head1 VERSION
version v0.7.7
=for stopwords BOM minify minifier
=head1 SYNOPSIS
use Text::Minify::XS qw/ minify /;
lib/Text/Minify/XS.pm view on Meta::CPAN
memory overflows. You should ensure that the input string is properly
encoded as UTF-8.
=head2 Byte Order Marks
The Byte Order Mark (BOM) at the beginning of a file will not be removed. That is because the minifier does not know
this is the beginning of a file or not.
=head1 SECURITY CONSIDERATIONS
Passing malformed UTF-8 characters may throw an exception, which in some cases could lead to a denial of service if
view all matches for this distribution
view release on metacpan or search on metacpan
t/000-report-versions.t view on Meta::CPAN
return $self->_error("Did not provide a string to load");
}
# Byte order marks
# NOTE: Keeping this here to educate maintainers
# my %BOM = (
# "\357\273\277" => 'UTF-8',
# "\376\377" => 'UTF-16BE',
# "\377\376" => 'UTF-16LE',
# "\377\376\0\0" => 'UTF-32LE'
# "\0\0\376\377" => 'UTF-32BE',
# );
if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
return $self->_error("Stream has a non UTF-8 BOM");
} else {
# Strip UTF-8 bom if found, we'll just ignore it
$string =~ s/^\357\273\277//;
}
view all matches for this distribution
view release on metacpan or search on metacpan
t/000-report-versions.t view on Meta::CPAN
return $self->_error("Did not provide a string to load");
}
# Byte order marks
# NOTE: Keeping this here to educate maintainers
# my %BOM = (
# "\357\273\277" => 'UTF-8',
# "\376\377" => 'UTF-16BE',
# "\377\376" => 'UTF-16LE',
# "\377\376\0\0" => 'UTF-32LE'
# "\0\0\376\377" => 'UTF-32BE',
# );
if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
return $self->_error("Stream has a non UTF-8 BOM");
} else {
# Strip UTF-8 bom if found, we'll just ignore it
$string =~ s/^\357\273\277//;
}
view all matches for this distribution
view release on metacpan or search on metacpan
t/000-report-versions.t view on Meta::CPAN
return $self->_error("Did not provide a string to load");
}
# Byte order marks
# NOTE: Keeping this here to educate maintainers
# my %BOM = (
# "\357\273\277" => 'UTF-8',
# "\376\377" => 'UTF-16BE',
# "\377\376" => 'UTF-16LE',
# "\377\376\0\0" => 'UTF-32LE'
# "\0\0\376\377" => 'UTF-32BE',
# );
if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
return $self->_error("Stream has a non UTF-8 BOM");
} else {
# Strip UTF-8 bom if found, we'll just ignore it
$string =~ s/^\357\273\277//;
}
view all matches for this distribution
view release on metacpan or search on metacpan
t/000-report-versions.t view on Meta::CPAN
return $self->_error("Did not provide a string to load");
}
# Byte order marks
# NOTE: Keeping this here to educate maintainers
# my %BOM = (
# "\357\273\277" => 'UTF-8',
# "\376\377" => 'UTF-16BE',
# "\377\376" => 'UTF-16LE',
# "\377\376\0\0" => 'UTF-32LE'
# "\0\0\376\377" => 'UTF-32BE',
# );
if ( $string =~ /^(?:\376\377|\377\376|\377\376\0\0|\0\0\376\377)/ ) {
return $self->_error("Stream has a non UTF-8 BOM");
} else {
# Strip UTF-8 bom if found, we'll just ignore it
$string =~ s/^\357\273\277//;
}
view all matches for this distribution
view release on metacpan or search on metacpan
doc/src/latest/reStructuredText.html view on Meta::CPAN
</li>
</ol>
<p>For example, none of the following are recognized as containing inline
markup start-strings:</p>
<ul class="simple">
<li>asterisks: * "*" '*' (*) (* [*] {*} 1*x BOM32_*</li>
<li>double asterisks: ** a**b O(N**2) etc.</li>
<li>backquotes: ` `` etc.</li>
<li>underscores: _ __ __init__ __init__() etc.</li>
<li>vertical bars: | || etc.</li>
</ul>
view all matches for this distribution
view release on metacpan or search on metacpan
- test
requires:
Clone: 0
Encode: 0
Encode::Locale: 0
File::BOM: 0
File::ShareDir: 0
FindBin: 0
Getopt::Std: 0
IO::File: 0
IPC::Open3: 0
view all matches for this distribution
view release on metacpan or search on metacpan
Corpus/written/newspaper:newswire/NYTnewswire9.txt view on Meta::CPAN
NYT20020731.0191
2002-07-31 18:45
A4537 &Cx1f; tad-z
u a BC-ATOMIC-BOMB-0801-COX 07-31 1808
BC-ATOMIC-BOMB-0801-COX
A mission to end a war
By DENISE GAMINO
Cox News Service
view all matches for this distribution
view release on metacpan or search on metacpan
lib/Text/TinySegmenter.pm view on Meta::CPAN
my %TC2 = ("HHO" => 2088,"HII" => -1023,"HMM" => -1154,"IHI" => -1965,"KKH" => 703,"OII" => -2649);
my %TC3 = ("AAA" => -294,"HHH" => 346,"HHI" => -341,"HII" => -1088,"HIK" => 731,"HOH" => -1486,"IHH" => 128,"IHI" => -3041,"IHO" => -1935,"IIH" => -825,"IIM" => -1035,"IOI" => -542,"KHH" => -1216,"KKA" => 491,"KKH" => -1217,"KOK" => -1009,"MHH" => -2...
my %TC4 = ("HHH" => -203,"HHI" => 1344,"HHK" => 365,"HHM" => -122,"HHN" => 182,"HHO" => 669,"HIH" => 804,"HII" => 679,"HOH" => 446,"IHH" => 695,"IHO" => -2324,"IIH" => 321,"III" => 1497,"IIO" => 656,"IOO" => 54,"KAK" => 4845,"KKA" => 3386,"KKK" => 30...
my %TQ1 = ("BHHH" => -227,"BHHI" => 316,"BHIH" => -132,"BIHH" => 60,"BIII" => 1595,"BNHH" => -744,"BOHH" => 225,"BOOO" => -908,"OAKK" => 482,"OHHH" => 281,"OHIH" => 249,"OIHI" => 200,"OIIH" => -68);
my %TQ2 = ("BIHH" => -1401,"BIII" => -1033,"BKAK" => -543,"BOOO" => -5591);
my %TQ3 = ("BHHH" => 478,"BHHM" => -1073,"BHIH" => 222,"BHII" => -504,"BIIH" => -116,"BIII" => -105,"BMHI" => -863,"BMHM" => -464,"BOMH" => 620,"OHHH" => 346,"OHHI" => 1729,"OHII" => 997,"OHMH" => 481,"OIHH" => 623,"OIIH" => 1344,"OKAK" => 2792,"OKHH...
my %TQ4 = ("BHHH" => -721,"BHHM" => -3604,"BHII" => -966,"BIIH" => -607,"BIII" => -2181,"OAAA" => -2763,"OAKK" => 180,"OHHH" => -294,"OHHI" => 2446,"OHHO" => 480,"OHIH" => -1573,"OIHH" => 1935,"OIHI" => -493,"OIIH" => 626,"OIII" => -4007,"OKAK" => -8...
my %TW1 = ("ã«ã¤ã" => -4681,"æ±äº¬é½" => 2026);
my %TW2 = ("ããç¨" => -2049,"ãã£ã" => -1256,"ããã" => -2434,"ããã" => 3873,"ãã®å¾" => -4430,"ã ã£ã¦" => -1049,"ã¦ãã" => 1833,"ã¨ãã¦" => -4657,"ã¨ãã«" => -4517,"ãã®ã§" => 1882,"䏿°ã«" => -792,"åãã¦" ...
my %TW3 = ("ããã " => -1734,"ãã¦ã" => 1314,"ã¨ãã¦" => -4314,"ã«ã¤ã" => -5483,"ã«ã¨ã£" => -5989,"ã«å½ã" => -6247,"ã®ã§," => -727,"ã®ã§ã" => -727,"ã®ãã®" => -600,"ããã" => -3752,"åäºæ" => -2287);
my %TW4 = ("ãã." => 8576,"ããã" => 8576,"ãããª" => -2348,"ãã¦ã" => 2958,"ãã," => 1516,"ããã" => 1516,"ã¦ãã" => 1538,"ã¨ãã" => 1349,"ã¾ãã" => 5543,"ã¾ãã" => 1097,"ããã¨" => -4258,"ããã¨" => 5865);
view all matches for this distribution
view release on metacpan or search on metacpan
share/en_US/technology/information view on Meta::CPAN
BMP Basic Multilingual Plane
BNC Bayonet Neill-Concelman
BOFH bastard operator from hell
BOHICA bend over here it comes again
BOINC Berkeley Open Infrastructure for Network Computing
BOM Byte Order Mark
BOOTP Bootstrap Protocol
BPDU Bridge Protocol Data Unit
BPEL Business Process Execution Language
BPL Broadband over Power Lines
BPS bits per second
view all matches for this distribution
view release on metacpan or search on metacpan
t/encoding.t view on Meta::CPAN
env_compare [qw(utf8 c)] => 'ascii is fine',
sub { string("( hi )\n") }, tvc_html('hi');
# FIXME: These don't work on a lot of smokers (particularly FreeBSD),
# (even ones with vim 7.2 +multi_byte).
# We could do a simple check to see if BOM recognition works as we expect it
# and only then perform this test, because really we only want to test that
# we haven't broken this functionality in environments where it already worked.
# Aside from lots of failing reports, see also rt-92601.
TODO: {
local $TODO = 'Do simpler pre-tests to determine if these tests should pass in this evironment.';
env_compare utf8 => 'use BOM to get vim to honor encoded text',
sub { prepend_bom($filetype, $input) }, $html;
env_compare utf8 => 'specify encoding by adding "+set fenc=..." to vim_options',
sub { pass_vim_options(undef, $input, {filetype => $filetype}) }, $html;
t/encoding.t view on Meta::CPAN
my ($lang, $str) = @_;
# MORITZ/App-Mowyw-v0.7.1/lib/App/Mowyw.pm#L566
{
# any encoding will do if vim automatically detects it
my $vim_encoding = 'utf-8';
my $BOM = "\x{feff}";
my $syn = Text::VimColor->new(
filetype => $lang,
string => encode($vim_encoding, $BOM . $str),
);
$str = decode($vim_encoding, $syn->html);
$str =~ s/^$BOM//;
return $str;
};
}
sub pass_vim_options {
view all matches for this distribution
view release on metacpan or search on metacpan
VisualWidth.xs view on Meta::CPAN
int count_single_char_utf8( const unsigned char** pos, int* byte ){
*byte = 0;
if( **pos == 0 ) return 0;
if( **pos == 0xef && *((*pos)+1) == 0xbb && *((*pos)+2) == 0xbf ){
// BOM
(*pos)+= 3;
(*byte)+= 3;
// printf("BOM\n");
return 0;
} else if( ( **pos & 0xe0 ) == 0xc0 && ( ( *((*pos)+1) & 0xc0 ) == 0x80 ) ){
(*pos)+= 2;
(*byte)+= 2;
// printf("2byte\n");
view all matches for this distribution