BOM results from the CPAN

Text-CSV_XS

view release on metacpan or search on metacpan

     }

The old(er) way of using global file handles is still supported

 while (my $row = $csv->getline (*ARGV)) { ... }

=head2 Unicode

Unicode is only tested to work with perl-5.8.2 and up.

See also L</BOM>.

The simplest way to ensure the correct encoding is used for  in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.

 open my $fh, "<:encoding(UTF-8)", "in.csv"  or die "in.csv: $!";
or
 my $aoa = csv (in => "in.csv",     encoding => "UTF-8");

 open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";

CSV_XS.pm view on Meta::CPAN

     });

 $csv = Text::CSV::Encoded->new ({ encoding  => "utf8" });
 # combine () and print () accept *literally* utf8 encoded data
 # parse () and getline () return *literally* utf8 encoded data

 $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
 # combine () and print () accept UTF8 marked data
 # parse () and getline () return UTF8 marked data

=head2 BOM

BOM  (or Byte Order Mark)  handling is available only inside the L</header>
method.   This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.

If a file has a BOM, the easiest way to deal with that is

 my $aoh = csv (in => $file, detect_bom => 1);

All records will be encoded based on the detected BOM.

This implies a call to the  L</header>  method,  which defaults to also set
the L</column_names>. So this is B<not> the same as

 my $aoh = csv (in => $file, headers => "auto");

which only reads the first record to set  L</column_names>  but ignores any
meaning of possible present BOM.

=head1 SPECIFICATION

While no formal specification for CSV exists, L<RFC 4180|https://datatracker.ietf.org/doc/html/rfc4180>
(I<1>) describes the common format and establishes  C<text/csv> as the MIME
type registered with the IANA. L<RFC 7111|https://datatracker.ietf.org/doc/html/rfc7111>
(I<2>) adds fragments to CSV.

Many informal documents exist that describe the C<CSV> format.   L<"How To:
The Comma Separated Value (CSV) File Format"|http://creativyst.com/Doc/Articles/CSV/CSV01.shtml>

CSV_XS.pm view on Meta::CPAN

 $csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);

Multi-byte  sequences are allowed,  both multi-character and  Unicode.  See
L<C<sep>|/sep>.

=item detect_bom
X<detect_bom>

 $csv->header ($fh, { detect_bom => 1 });

The default behavior is to detect if the header line starts with a BOM.  If
the header has a BOM, use that to set the encoding of C<$fh>.  This default
behavior can be disabled by passing a false value to C<detect_bom>.

Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,  and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1,  and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.

If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.

 my $enc = $csv->{ENCODING};

The encoding is used with C<binmode> on C<$fh>.

If the handle was opened in a (correct) encoding,  this method will  B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.

=item munge_column_names
X<munge_column_names>

This option offers the means to modify the column names into something that
is most useful to the application.   The default is to map all column names
to lower case.

 $csv->header ($fh, { munge_column_names => "lc" });

CSV_XS.pm view on Meta::CPAN


=head3 encoding
X<encoding>

If passed,  it should be an encoding accepted by the  C<:encoding()> option
to C<open>. There is no default value. This attribute does not work in perl
5.6.x.  C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.

If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly.   This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.

Encodings can be stacked, as supported by C<binmode>:

 # Using PerlIO::via::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":via(gzip):encoding(utf-8)",
      );

CSV_XS.pm view on Meta::CPAN

 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":gzip:encoding(utf-8)",
      );
 $aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");

=head3 detect_bom
X<detect_bom>

If  C<detect_bom>  is given, the method  L</header>  will be invoked on the
opened stream to check if there is a  BOM and set the encoding accordingly.
Note that the attribute L<C<headers>|/headers>  can be used to overrule the
default behavior of how that method automatically sets the attribute.

C<detect_bom> can be abbreviated to C<bom>.

This is the same as setting L<C<encoding>|/encoding> to C<"auto">.

=head3 headers
X<headers>

CSV_XS.pm view on Meta::CPAN

=head2 Rewriting CSV

=head3 Changing separator

Rewrite C<CSV> files with C<;> as separator character to well-formed C<CSV>:

 use Text::CSV_XS qw( csv );
 csv (in => csv (in => "bad.csv", sep_char => ";"), out => *STDOUT);

As C<STDOUT> is now default in L</csv>, a one-liner converting a UTF-16 CSV
file with BOM and TAB-separation to valid UTF-8 CSV could be:

 $ perl -C3 -MText::CSV_XS=csv -we\
    'csv(in=>"utf16tab.csv",encoding=>"utf16",sep=>"\t")' >utf8.csv

=head3 Unifying EOL

Rewrite a CSV file with mixed EOL  and/or inconsistent quotation into a new
CSV file with consistent EOL and quotation. Attributes apply.

 use Text::CSV_XS qw( csv );

CSV_XS.pm view on Meta::CPAN


 $ csvdiff --html --output=diff.html file1.csv file2.csv

=item rewrite.pl
X<rewrite.pl>

A script to rewrite (in)valid CSV into valid CSV files.  Script has options
to generate confusing CSV files or CSV files that conform to Dutch MS-Excel
exports (using C<;> as separation).

Script - by default - honors BOM  and auto-detects separation converting it
to default standard CSV with C<,> as separator.

=back

=head1 CAVEATS

Text::CSV_XS  is I<not> designed to detect the characters used to quote and
separate fields.  The parsing is done using predefined  (default) settings.
In the examples  sub-directory,  you can find scripts  that demonstrate how
you could try to detect these characters yourself.

ChangeLog view on Meta::CPAN

    * Make detect_bom result available
    * It's 2018
    * Add csv (out => \"skip") - suppress output deliberately
    * Allow sub as top-level filter
    * Tested against Test2::Harness-0.001062 (yath test)
    * Tested against perl-5.27.10

1.34	- 2017-11-05, H.Merijn Brand
    * Bad arg for formula (like "craok") will now die with error 1500
    * Row report in formula reporting was off by 1
    * Add a prominent section about BOM handling
    * Make sheet label more portable (csv2xlsx)
    * Allow munge => \%hash
    * Preserve first row in csv (set_column_names => 0)

1.33	- 2017-10-19, H.Merijn Brand
    * Small additional fix for eol = \r + BOM
    * Updated doc for example files
    * Add support for formula actions (issue 11)
      - csv2xls and csv2xlsx now warn by default
    * Reset file info on ->header call (RT#123320)

1.32	- 2017-09-15, H.Merijn Brand
    * Add keep_headers attribute to csv ()
    * Fix on_in when used in combination with key
    * Fail on invalid arguments to csv
    * Fix header method on EOL = CR (RT#122764)

1.31	- 2017-06-13, H.Merijn Brand
    * Fix already decoded BOM in headers
    * New options in csv-check
    * Some perlcritic
    * "escape" is alias for "escape_char" for consistency.
    * Code cleanup and more tests (Devel::Cover)
    * Improve csv-check auto-sep-detection

1.30	- 2017-06-08, H.Merijn Brand
    * Add csv (..., out => ...) syntax examples (issue 7)
    * Disable escape_null for undefined escape_char
    * Fix ->say for bound columns (RT#121576)

examples/csv-check view on Meta::CPAN

                          the string "tab" is allowed.
    -q Q  --quo=Q         use Q as quote     char. Auto-detect, default = '"'
                          the string "undef" will disable quotation.
    -e E  --esc=E         use E as escape    char. Auto-detect, default = '"'
                          the string "undef" will disable escapes.
    -N    --nl            force EOL to \\n
    -C    --cr            force EOL to \\r
    -M    --crnl          force EOL to \\r\\n
    -u    --utf-8         check if all fields are valid unicode
    -E E  --enc=E         open file with encoding E
    -h    --hdr           check with header (implies BOM)
    -b    --bom           check with BOM (no header)
    -f    --skip-formula  do not check formula's

          --pp            use Text::CSV_PP instead (cross-check)

    -A a  --attr=at:val   pass attributes to parser
			    --at=val is also supported for know attributes
    -L    --list-attr     list supported CSV attributes
    -X    --list-changes  list attributes that changed from default
EOU
    exit $err;

examples/rewrite.pl view on Meta::CPAN


sub usage {
    my $err = shift and select STDERR;
    print <<"EOH";
usage: $0 [-o file] [-s S] [-m] [-c] [-i] [file]
    -o F  --out=F     output to file F (default STDOUT)
    -s S  --sep=S     set input separator to S (default ; , TAB or |)
    -m    --ms        output Dutch style MicroSoft CSV (; and \\r\\n)
    -n    --no-header CSV has no header line. If selected
                      - default input sep = ;
                      - BOM is not used/recognized
    -c    --confuse   Use confusing separation and quoting characters
    -i    --invisible Use invisible separation and quotation sequences
EOH
    exit $err;
    } # usage

use Getopt::Long qw(:config bundling);
GetOptions (
    "help|?"		=> sub { usage (0); },
    "s|sep=s"		=> \my $in_sep,

ppport.h view on Meta::CPAN

BmPREVIOUS|5.003007||Viu
BmRARE|5.003007||Viu
BmUSEFUL|5.003007||Viu
BOL|5.003007||Viu
BOL_t8|5.035004||Viu
BOL_t8_p8|5.033003||Viu
BOL_t8_pb|5.033003||Viu
BOL_tb|5.035004||Viu
BOL_tb_p8|5.033003||Viu
BOL_tb_pb|5.033003||Viu
BOM_UTF8|5.025005|5.003007|p
BOM_UTF8_FIRST_BYTE|5.019004||Viu
BOM_UTF8_TAIL|5.019004||Viu
boolSV|5.004000|5.003007|p
boot_core_builtin|5.035007||Viu
boot_core_mro|5.009005||Viu
boot_core_PerlIO|5.007002||Viu
boot_core_UNIVERSAL|5.003007||Viu
BOUND|5.003007||Viu
BOUNDA|5.013009||Viu
BOUNDA_t8|5.035004||Viu
BOUNDA_t8_p8|5.033003||Viu
BOUNDA_t8_pb|5.033003||Viu

ppport.h view on Meta::CPAN

#ifndef isUTF8_CHAR
#  define isUTF8_CHAR(s, e)              (                                            \
    (e) <= (s) || ! is_utf8_string(s, UTF8_SAFE_SKIP(s, e))                     \
    ? 0                                                                         \
    : UTF8SKIP(s))
#endif

#endif

#if 'A' == 65
#ifndef BOM_UTF8
#  define BOM_UTF8                       "\xEF\xBB\xBF"
#endif

#ifndef REPLACEMENT_CHARACTER_UTF8
#  define REPLACEMENT_CHARACTER_UTF8     "\xEF\xBF\xBD"
#endif

#elif '^' == 95
#ifndef BOM_UTF8
#  define BOM_UTF8                       "\xDD\x73\x66\x73"
#endif

#ifndef REPLACEMENT_CHARACTER_UTF8
#  define REPLACEMENT_CHARACTER_UTF8     "\xDD\x73\x73\x71"
#endif

#elif '^' == 176
#ifndef BOM_UTF8
#  define BOM_UTF8                       "\xDD\x72\x65\x72"
#endif

#ifndef REPLACEMENT_CHARACTER_UTF8
#  define REPLACEMENT_CHARACTER_UTF8     "\xDD\x72\x72\x70"
#endif

#else
#  error Unknown character set
#endif

t/85_util.t view on Meta::CPAN

	    #$ebcdic and $has_enc = 0; # TODO

	    $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 9 });

	    SKIP: {
		$has_enc or skip "Encoding $enc not supported", $enc =~ m/^utf/ ? 10 : 9;
		$csv->column_names (undef);
		open my $fh, "<", $fnm;
		binmode $fh;
		ok (1, "$fnm opened for enc $enc");
		ok ($csv->header ($fh), "headers with BOM for $enc");
		$enc =~ m/^utf/ and is ($csv->{ENCODING}, uc $enc, "Encoding inquirable");

		is (($csv->column_names)[1], "b${a_ring}r", "column name was decoded");
		ok (my $row = $csv->getline_hr ($fh), "getline_hr");
		is ($row->{"b${a_ring}r"}, "1 \x{20ac} each", "Returned in Unicode");
		close $fh;

		my $aoh;
		ok ($aoh = csv (in => $fnm, bom => 1), "csv (bom => 1)");
		is_deeply ($aoh,

t/85_util.t view on Meta::CPAN

		is_deeply ($aoh,
		    [{ zoo => 1, "b${a_ring}r" => "1 \x{20ac} each" }], "Returned data auto");
		}

	    SKIP: {
		$has_enc or skip "Encoding $enc not supported", 7;
		$csv->column_names (undef);
		open my $fh, "<", $fnm;
		$enc eq "none" or binmode $fh, ":encoding($enc)";
		ok (1, "$fnm opened for enc $enc");
		ok ($csv->header ($fh), "headers with BOM for $enc");
		is (($csv->column_names)[1], "b${a_ring}r", "column name was decoded");
		ok (my $row = $csv->getline_hr ($fh), "getline_hr");
		is ($row->{"b${a_ring}r"}, "1 \x{20ac} each", "Returned in Unicode");
		close $fh;

		ok (my $aoh = csv (in => $fnm, bom => 1), "csv (bom => 1)");
		is_deeply ($aoh,
		    [{ zoo => 1, "b${a_ring}r" => "1 \x{20ac} each" }], "Returned data");
		}

t/91_csv_cb.t view on Meta::CPAN

		[ { FOO => 1, BAR => 2,     BAZ => 3 },
		  { FOO => 2, BAR => "a b", BAZ => "" }],
		"AOH with lc headers");
    is_deeply (csv (in => $tfn, headers => sub { lcfirst uc $_[0] }),
		[ { fOO => 1, bAR => 2,     bAZ => 3 },
		  { fOO => 2, bAR => "a b", bAZ => "" }],
		"AOH with mangled headers");
    }

SKIP: {
    $] < 5.008001 and skip "No BOM support in $]", 1;
    is_deeply (csv (in => $tfn, munge => { bar => "boo" }),
	[{ baz =>  3, boo => 2,     foo => 1 },
	 { baz => "", boo => "a b", foo => 2 }], "Munge with hash");
    }

open  $fh, ">>", $tfn or die "$tfn: $!";
print $fh <<"EOD";
3,3,3
4,5,6
5,7,9

t/util.pl view on Meta::CPAN

   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # 8
   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # 9
   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # A
   -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,  # B
   -1,-1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # C
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # D
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,  # E
    4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7,13,  # F
    ) : ();

# Used for BOM testing
*byte_utf8a_to_utf8n = $ebcdic ? sub {
    # Convert a UTF-8 byte sequence into the platform's native UTF-8
    # equivalent, currently only UTF-8 and UTF-EBCDIC.

    my $string = shift;
    utf8::is_utf8 ($string) and return $string;

    my $length = length $string;
    #diag ($string);
    #diag ($length);

( run in 0.497 second using v1.01-cache-2.11-cpan-131fc08a04b )