BOM results from the CPAN

Text-CSV

view release on metacpan or search on metacpan


1.97  2018-08-17
    - Fix/add minimum perl version (GH-38, Kivanc Yazan++)
    - Updated MANIFEST

1.96  2018-08-14
    - Imported tests/fixes from Text::CSV_XS 1.36
      - Added undef_str and keep_headers attributes
      - Added csv(out => \"skip")
      - Added formula actions
      - Fixed BOM issues
    - Fixed internal cache handling
    - Added license and preferred issue tracker to META files (GH#26, garu++)

1.95  2017-04-27
    - import "strict" attribute introduced in Text::CSV_XS 1.29

1.94  2017-04-11
    - Fix 5.6.2 issues

1.93  2017-04-04

lib/Text/CSV.pm view on Meta::CPAN

     }

The old(er) way of using global file handles is still supported

 while (my $row = $csv->getline (*ARGV)) { ... }

=head2 Unicode

Unicode is only tested to work with perl-5.8.2 and up.

See also L</BOM>.

The simplest way to ensure the correct encoding is used for  in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.

 open my $fh, "<:encoding(UTF-8)", "in.csv"  or die "in.csv: $!";
or
 my $aoa = csv (in => "in.csv",     encoding => "UTF-8");

 open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";

lib/Text/CSV.pm view on Meta::CPAN

     });

 $csv = Text::CSV::Encoded->new ({ encoding  => "utf8" });
 # combine () and print () accept *literally* utf8 encoded data
 # parse () and getline () return *literally* utf8 encoded data

 $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
 # combine () and print () accept UTF8 marked data
 # parse () and getline () return UTF8 marked data

=head2 BOM

BOM  (or Byte Order Mark)  handling is available only inside the L</header>
method.   This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.

If a file has a BOM, the easiest way to deal with that is

 my $aoh = csv (in => $file, detect_bom => 1);

All records will be encoded based on the detected BOM.

This implies a call to the  L</header>  method,  which defaults to also set
the L</column_names>. So this is B<not> the same as

 my $aoh = csv (in => $file, headers => "auto");

which only reads the first record to set  L</column_names>  but ignores any
meaning of possible present BOM.

=head1 METHODS

This section is also taken from Text::CSV_XS.

=head2 version

(Class method) Returns the current module version.

=head2 new

lib/Text/CSV.pm view on Meta::CPAN


 $csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);

Multi-byte  sequences are allowed,  both multi-character and  Unicode.  See
L<C<sep>|/sep>.

=item detect_bom

 $csv->header ($fh, { detect_bom => 1 });

The default behavior is to detect if the header line starts with a BOM.  If
the header has a BOM, use that to set the encoding of C<$fh>.  This default
behavior can be disabled by passing a false value to C<detect_bom>.

Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,  and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1,  and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.

If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.

 my $enc = $csv->{ENCODING};

The encoding is used with C<binmode> on C<$fh>.

If the handle was opened in a (correct) encoding,  this method will  B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.

=item munge_column_names

This option offers the means to modify the column names into something that
is most useful to the application.   The default is to map all column names
to lower case.

 $csv->header ($fh, { munge_column_names => "lc" });

lib/Text/CSV.pm view on Meta::CPAN

Note that this is work in progress and things might change.

=head3 encoding

If passed,  it should be an encoding accepted by the  C<:encoding()> option
to C<open>. There is no default value. This attribute does not work in perl
5.6.x.  C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.

If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly.   This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.

Encodings can be stacked, as supported by C<binmode>:

 # Using PerlIO::via::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":via(gzip):encoding(utf-8)",
      );

lib/Text/CSV.pm view on Meta::CPAN

 # Using PerlIO::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":gzip:encoding(utf-8)",
      );
 $aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");

=head3 detect_bom

If  C<detect_bom>  is given, the method  L</header>  will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.

C<detect_bom> can be abbreviated to C<bom>.

This is the same as setting L<C<encoding>|/encoding> to C<"auto">.

Note that as the method  L</header> is invoked,  its default is to also set
the headers.

=head3 headers

lib/Text/CSV_PP.pm view on Meta::CPAN

     }

The old(er) way of using global file handles is still supported

 while (my $row = $csv->getline (*ARGV)) { ... }

=head2 Unicode

Unicode is only tested to work with perl-5.8.2 and up.

See also L</BOM>.

The simplest way to ensure the correct encoding is used for  in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.

 open my $fh, "<:encoding(UTF-8)", "in.csv"  or die "in.csv: $!";
or
 my $aoa = csv (in => "in.csv",     encoding => "UTF-8");

 open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";

lib/Text/CSV_PP.pm view on Meta::CPAN

     });

 $csv = Text::CSV::Encoded->new ({ encoding  => "utf8" });
 # combine () and print () accept *literally* utf8 encoded data
 # parse () and getline () return *literally* utf8 encoded data

 $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
 # combine () and print () accept UTF8 marked data
 # parse () and getline () return UTF8 marked data

=head2 BOM

BOM  (or Byte Order Mark)  handling is available only inside the L</header>
method.   This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.

If a file has a BOM, the easiest way to deal with that is

 my $aoh = csv (in => $file, detect_bom => 1);

All records will be encoded based on the detected BOM.

This implies a call to the  L</header>  method,  which defaults to also set
the L</column_names>. So this is B<not> the same as

 my $aoh = csv (in => $file, headers => "auto");

which only reads the first record to set  L</column_names>  but ignores any
meaning of possible present BOM.

=head1 METHODS

This section is also taken from Text::CSV_XS.

=head2 version

(Class method) Returns the current module version.

=head2 new

lib/Text/CSV_PP.pm view on Meta::CPAN


 $csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);

Multi-byte  sequences are allowed,  both multi-character and  Unicode.  See
L<C<sep>|/sep>.

=item detect_bom

 $csv->header ($fh, { detect_bom => 1 });

The default behavior is to detect if the header line starts with a BOM.  If
the header has a BOM, use that to set the encoding of C<$fh>.  This default
behavior can be disabled by passing a false value to C<detect_bom>.

Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,  and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1,  and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.

If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.

 my $enc = $csv->{ENCODING};

The encoding is used with C<binmode> on C<$fh>.

If the handle was opened in a (correct) encoding,  this method will  B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.

=item munge_column_names

This option offers the means to modify the column names into something that
is most useful to the application.   The default is to map all column names
to lower case.

 $csv->header ($fh, { munge_column_names => "lc" });

lib/Text/CSV_PP.pm view on Meta::CPAN

Note that this is work in progress and things might change.

=head3 encoding

If passed,  it should be an encoding accepted by the  C<:encoding()> option
to C<open>. There is no default value. This attribute does not work in perl
5.6.x.  C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.

If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly.   This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.

Encodings can be stacked, as supported by C<binmode>:

 # Using PerlIO::via::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":via(gzip):encoding(utf-8)",
      );

lib/Text/CSV_PP.pm view on Meta::CPAN

 # Using PerlIO::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":gzip:encoding(utf-8)",
      );
 $aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");

=head3 detect_bom

If  C<detect_bom>  is given, the method  L</header>  will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.

C<detect_bom> can be abbreviated to C<bom>.

This is the same as setting L<C<encoding>|/encoding> to C<"auto">.

Note that as the method  L</header> is invoked,  its default is to also set
the headers.

=head3 headers

t/85_util.t view on Meta::CPAN

	    #$ebcdic and $has_enc = 0; # TODO

	    $csv = Text::CSV->new ({ binary => 1, auto_diag => 9 });

	    SKIP: {
		$has_enc or skip "Encoding $enc not supported", $enc =~ m/^utf/ ? 10 : 9;
		$csv->column_names (undef);
		open my $fh, "<", $fnm;
		binmode $fh;
		ok (1, "$fnm opened for enc $enc");
		ok ($csv->header ($fh), "headers with BOM for $enc");
		$enc =~ m/^utf/ and is ($csv->{ENCODING}, uc $enc, "Encoding inquirable");

		is (($csv->column_names)[1], "b${a_ring}r", "column name was decoded");
		ok (my $row = $csv->getline_hr ($fh), "getline_hr");
		is ($row->{"b${a_ring}r"}, "1 \x{20ac} each", "Returned in Unicode");
		close $fh;

		my $aoh;
		ok ($aoh = csv (in => $fnm, bom => 1), "csv (bom => 1)");
		is_deeply ($aoh,

t/85_util.t view on Meta::CPAN

		is_deeply ($aoh,
		    [{ zoo => 1, "b${a_ring}r" => "1 \x{20ac} each" }], "Returned data auto");
		}

	    SKIP: {
		$has_enc or skip "Encoding $enc not supported", 7;
		$csv->column_names (undef);
		open my $fh, "<", $fnm;
		$enc eq "none" or binmode $fh, ":encoding($enc)";
		ok (1, "$fnm opened for enc $enc");
		ok ($csv->header ($fh), "headers with BOM for $enc");
		is (($csv->column_names)[1], "b${a_ring}r", "column name was decoded");
		ok (my $row = $csv->getline_hr ($fh), "getline_hr");
		is ($row->{"b${a_ring}r"}, "1 \x{20ac} each", "Returned in Unicode");
		close $fh;

		ok (my $aoh = csv (in => $fnm, bom => 1), "csv (bom => 1)");
		is_deeply ($aoh,
		    [{ zoo => 1, "b${a_ring}r" => "1 \x{20ac} each" }], "Returned data");
		}

t/91_csv_cb.t view on Meta::CPAN

		[ { FOO => 1, BAR => 2,     BAZ => 3 },
		  { FOO => 2, BAR => "a b", BAZ => "" }],
		"AOH with lc headers");
    is_deeply (csv (in => $tfn, headers => sub { lcfirst uc $_[0] }),
		[ { fOO => 1, bAR => 2,     bAZ => 3 },
		  { fOO => 2, bAR => "a b", bAZ => "" }],
		"AOH with mangled headers");
    }

SKIP: {
    $] < 5.008001 and skip "No BOM support in $]", 1;
    is_deeply (csv (in => $tfn, munge => { bar => "boo" }),
	[{ baz =>  3, boo => 2,     foo => 1 },
	 { baz => "", boo => "a b", foo => 2 }], "Munge with hash");
    }

open  $fh, ">>", $tfn or die "$tfn: $!";
print $fh <<"EOD";
3,3,3
4,5,6
5,7,9

t/util.pl view on Meta::CPAN

   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # 8
   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # 9
   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # A
   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # B
   -1,-1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # C
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # D
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,  # E
    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7,13,  # F
    ) : ();

# Used for BOM testing
*byte_utf8a_to_utf8n = $ebcdic ? sub {
    # Convert a UTF-8 byte sequence into the platform's native UTF-8
    # equivalent, currently only UTF-8 and UTF-EBCDIC.

    my $string = shift;
    utf8::is_utf8 ($string) and return $string;

    my $length = length $string;
    #diag ($string);
    #diag ($length);

( run in 0.384 second using v1.01-cache-2.11-cpan-131fc08a04b )