BOM results from the CPAN

Text-CSV_XS

view release on metacpan or search on metacpan

 # Read/parse CSV
 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
 open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
 while (my $row = $csv->getline ($fh)) {
     $row->[2] =~ m/pattern/ or next; # 3rd field should match
     push @rows, $row;
     }
 close $fh;

 # and write as CSV
 open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
 $csv->say ($fh, $_) for @rows;
 close $fh or die "new.csv: $!";

=head1 DESCRIPTION

Text::CSV_XS  provides facilities for the composition  and decomposition of
comma-separated values.  An instance of the Text::CSV_XS class will combine
fields into a C<CSV> string and parse a C<CSV> string into fields.

The module accepts either strings or files as input  and support the use of
user-specified characters for delimiters, separators, and escapes.

=head2 Embedded newlines

B<Important Note>:  The default behavior is to accept only ASCII characters
in the range from C<0x20> (space) to C<0x7E> (tilde).   This means that the
fields can not contain newlines. If your data contains newlines embedded in
fields, or characters above C<0x7E> (tilde), or binary data, you B<I<must>>
set C<< binary => 1 >> in the call to L</new>. To cover the widest range of
parsing options, you will always want to set binary.

But you still have the problem  that you have to pass a correct line to the
L</parse> method, which is more complicated from the usual point of usage:

 my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
 while (<>) {		#  WRONG!
     $csv->parse ($_);
     my @fields = $csv->fields ();
     }

this will break, as the C<while> might read broken lines:  it does not care
about the quoting. If you need to support embedded newlines,  the way to go
is to  B<not>  pass L<C<eol>|/eol> in the parser  (it accepts C<\n>, C<\r>,
B<and> C<\r\n> by default) and then

 my $csv = Text::CSV_XS->new ({ binary => 1 });
 open my $fh, "<", $file or die "$file: $!";
 while (my $row = $csv->getline ($fh)) {
     my @fields = @$row;
     }

The old(er) way of using global file handles is still supported

 while (my $row = $csv->getline (*ARGV)) { ... }

=head2 Unicode

Unicode is only tested to work with perl-5.8.2 and up.

See also L</BOM>.

The simplest way to ensure the correct encoding is used for  in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.

 open my $fh, "<:encoding(UTF-8)", "in.csv"  or die "in.csv: $!";
or
 my $aoa = csv (in => "in.csv",     encoding => "UTF-8");

 open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";
or
 csv (in => $aoa, out => "out.csv", encoding => "UTF-8");

On parsing (both for  L</getline> and  L</parse>),  if the source is marked
being UTF8, then all fields that are marked binary will also be marked UTF8.

On combining (L</print>  and  L</combine>):  if any of the combining fields
was marked UTF8, the resulting string will be marked as UTF8.  Note however
that all fields  I<before>  the first field marked UTF8 and contained 8-bit
characters that were not upgraded to UTF8,  these will be  C<bytes>  in the
resulting string too, possibly causing unexpected errors.  If you pass data
of different encoding,  or you don't know if there is  different  encoding,
force it to be upgraded before you pass them on:

 $csv->print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);

For complete control over encoding, please use L<Text::CSV::Encoded>:

 use Text::CSV::Encoded;
 my $csv = Text::CSV::Encoded->new ({
     encoding_in  => "iso-8859-1", # the encoding comes into   Perl
     encoding_out => "cp1252",     # the encoding comes out of Perl
     });

 $csv = Text::CSV::Encoded->new ({ encoding  => "utf8" });
 # combine () and print () accept *literally* utf8 encoded data
 # parse () and getline () return *literally* utf8 encoded data

 $csv = Text::CSV::Encoded->new ({ encoding  => undef }); # default
 # combine () and print () accept UTF8 marked data
 # parse () and getline () return UTF8 marked data

=head2 BOM

BOM  (or Byte Order Mark)  handling is available only inside the L</header>
method.   This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.

If a file has a BOM, the easiest way to deal with that is

 my $aoh = csv (in => $file, detect_bom => 1);

All records will be encoded based on the detected BOM.

This implies a call to the  L</header>  method,  which defaults to also set
the L</column_names>. So this is B<not> the same as

 my $aoh = csv (in => $file, headers => "auto");

which only reads the first record to set  L</column_names>  but ignores any
meaning of possible present BOM.

=head1 SPECIFICATION

While no formal specification for CSV exists, L<RFC 4180|https://datatracker.ietf.org/doc/html/rfc4180>
(I<1>) describes the common format and establishes  C<text/csv> as the MIME
type registered with the IANA. L<RFC 7111|https://datatracker.ietf.org/doc/html/rfc7111>
(I<2>) adds fragments to CSV.

Many informal documents exist that describe the C<CSV> format.   L<"How To:
The Comma Separated Value (CSV) File Format"|http://creativyst.com/Doc/Articles/CSV/CSV01.shtml>
(I<3>)  provides an overview of the  C<CSV>  format in the most widely used
applications and explains how it can best be used and supported.

 1) https://datatracker.ietf.org/doc/html/rfc4180
 2) https://datatracker.ietf.org/doc/html/rfc7111
 3) http://creativyst.com/Doc/Articles/CSV/CSV01.shtml

The basic rules are as follows:

B<CSV>  is a delimited data format that has fields/columns separated by the
comma character and records/rows separated by newlines. Fields that contain
a special character (comma, newline, or double quote),  must be enclosed in
double quotes. However, if a line contains a single entry that is the empty
string, it may be enclosed in double quotes.  If a field's value contains a
double quote character it is escaped by placing another double quote
character next to it. The C<CSV> file format does not require a specific
character encoding, byte order, or line terminator format.

=over 2

=item *

Each record is a single line ended by a line feed  (ASCII/C<LF>=C<0x0A>) or
a carriage return and line feed pair (ASCII/C<CRLF>=C<0x0D 0x0A>), however,
line-breaks may be embedded.

=item *

Fields are separated by commas.

=item *

Allowable characters within a C<CSV> field include C<0x09> (C<TAB>) and the
inclusive range of C<0x20> (space) through C<0x7E> (tilde).  In binary mode
all characters are accepted, at least in quoted fields.

=item *

A field within  C<CSV>  must be surrounded by  double-quotes to  contain  a
separator character (comma).

=back

Though this is the most clear and restrictive definition,  Text::CSV_XS  is
way more liberal than this, and allows extension:

=over 2

=item *

CSV_XS.pm view on Meta::CPAN

line from the open handle then auto-detect whether the header separates the
column names with a character from the allowed separator list.

If any of the allowed separators matches,  and none of the I<other> allowed
separators match,  set  L<C<sep>|/sep>  to that  separator  for the current
CSV_XS instance and use it to parse the first line, map those to lowercase,
and use that to set the instance L</column_names>:

 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
 open my $fh, "<", "file.csv";
 binmode $fh; # for Windows
 $csv->header ($fh);
 while (my $row = $csv->getline_hr ($fh)) {
     ...
     }

If the header is empty,  contains more than one unique separator out of the
allowed set,  contains empty fields,   or contains identical fields  (after
folding), it will croak with error 1010, 1011, 1012, or 1013 respectively.

If the header contains embedded newlines or is not valid  CSV  in any other
way, this method will croak and leave the parse error untouched.

A successful call to C<header>  will always set the  L<C<sep>|/sep>  of the
C<$csv> object. This behavior can not be disabled.

=head3 return value

On error this method will croak.

In list context,  the headers will be returned whether they are used to set
L</column_names> or not.

In scalar context, the instance itself is returned.  B<Note>: the values as
found in the header will effectively be  B<lost> if  C<set_column_names> is
false.

=head3 Options

=over 2

=item sep_set
X<sep_set>

 $csv->header ($fh, { sep_set => [ ";", ",", "|", "\t" ] });

The list of legal separators defaults to C<[ ";", "," ]> and can be changed
by this option.  As this is probably the most often used option,  it can be
passed on its own as an unnamed argument:

 $csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);

Multi-byte  sequences are allowed,  both multi-character and  Unicode.  See
L<C<sep>|/sep>.

=item detect_bom
X<detect_bom>

 $csv->header ($fh, { detect_bom => 1 });

The default behavior is to detect if the header line starts with a BOM.  If
the header has a BOM, use that to set the encoding of C<$fh>.  This default
behavior can be disabled by passing a false value to C<detect_bom>.

Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,  and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1,  and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.

If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.

 my $enc = $csv->{ENCODING};

The encoding is used with C<binmode> on C<$fh>.

If the handle was opened in a (correct) encoding,  this method will  B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.

=item munge_column_names
X<munge_column_names>

This option offers the means to modify the column names into something that
is most useful to the application.   The default is to map all column names
to lower case.

 $csv->header ($fh, { munge_column_names => "lc" });

The following values are available:

  lc     - lower case
  uc     - upper case
  db     - valid DB field names
  none   - do not change
  \%hash - supply a mapping
  \&cb   - supply a callback

=over 2

=item Lower case

 $csv->header ($fh, { munge_column_names => "lc" });

The header is changed to all lower-case

 $_ = lc;

=item Upper case

 $csv->header ($fh, { munge_column_names => "uc" });

The header is changed to all upper-case

 $_ = uc;

=item Literal

 $csv->header ($fh, { munge_column_names => "none" });

=item Hash

 $csv->header ($fh, { munge_column_names => { foo => "sombrero" });

if a value does not exist, the original value is used unchanged

=item Database

 $csv->header ($fh, { munge_column_names => "db" });

=over 2

=item -

lower-case

=item -

CSV_XS.pm view on Meta::CPAN

If C<out> is set to a reference of the literal string C<"skip">, the output
will be suppressed completely,  which might be useful in combination with a
filter for side effects only.

 my %cache;
 csv (in    => "dump.csv",
      out   => \"skip",
      on_in => sub { $cache{$_[1][1]}++ });

Currently,  setting C<out> to any false value  (C<undef>, C<"">, 0) will be
equivalent to C<\"skip">.

If the C<in> argument point to something to parse, and the C<out> is set to
a reference to an C<ARRAY> or a C<HASH>, the output is appended to the data
in the existing reference. The result of the parse should match what exists
in the reference passed. This might come handy when you have to parse a set
of files with similar content (like data stored per period) and you want to
collect that into a single data structure:

 my %hash;
 csv (in => $_, out => \%hash, key => "id") for sort glob "foo-[0-9]*.csv";

 my @list; # List of arrays
 csv (in => $_, out => \@list)              for sort glob "foo-[0-9]*.csv";

 my @list; # List of hashes
 csv (in => $_, out => \@list, bom => 1)    for sort glob "foo-[0-9]*.csv";

=head4 Streaming
X<streaming>

If B<both> C<in> and C<out> are files, file handles or globs,  streaming is
enforced by injecting an C<after_parse> callback  that immediately uses the
L<C<say ()>|/say> method of the same instance to output the result and then
rejects the record.

If a C<after_parse> was already passed as attribute,  that will be included
in the injected call. If C<on_in> was passed and C<after_parse> was not, it
will be used instead. If both were passed, C<on_in> is ignored.

The EOL of the first record of the C<in> source is consistently used as EOL
for all records in the C<out> destination.

The C<filter> attribute is not available.

All other attributes are shared for C<in> and C<out>,  so you cannot define
different encodings for C<in> and C<out>.  You need to pass a C<$fh>, where
C<binmode> was used to apply the encoding layers.

Note that this is work in progress and things might change.

=head3 encoding
X<encoding>

If passed,  it should be an encoding accepted by the  C<:encoding()> option
to C<open>. There is no default value. This attribute does not work in perl
5.6.x.  C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.

If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly.   This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.

Encodings can be stacked, as supported by C<binmode>:

 # Using PerlIO::via::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":via(gzip):encoding(utf-8)",
      );
 $aoa = csv (in => "test.csv:via.gz",  encoding => ":via(gzip)");

 # Using PerlIO::gzip
 csv (in       => \@csv,
      out      => "test.csv:via.gz",
      encoding => ":gzip:encoding(utf-8)",
      );
 $aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");

=head3 detect_bom
X<detect_bom>

If  C<detect_bom>  is given, the method  L</header>  will be invoked on the
opened stream to check if there is a  BOM and set the encoding accordingly.
Note that the attribute L<C<headers>|/headers>  can be used to overrule the
default behavior of how that method automatically sets the attribute.

C<detect_bom> can be abbreviated to C<bom>.

This is the same as setting L<C<encoding>|/encoding> to C<"auto">.

=head3 headers
X<headers>

If this attribute is not given, the default behavior is to produce an array
of arrays.

If C<headers> is supplied,  it should be an anonymous list of column names,
an anonymous hashref, a coderef, or a literal flag:  C<auto>, C<lc>, C<uc>,
or C<skip>.

=over 2

=item skip
X<skip>

When C<skip> is used, the header will not be included in the output.

 my $aoa = csv (in => $fh, headers => "skip");

C<skip> is invalid/ignored in combinations with L<C<detect_bom>|/detect_bom>.

=item auto
X<auto>

If C<auto> is used, the first line of the C<CSV> source will be read as the
list of field headers and used to produce an array of hashes.

 my $aoh = csv (in => $fh, headers => "auto");

=item lc
X<lc>

If C<lc> is used,  the first line of the  C<CSV> source will be read as the
list of field headers mapped to  lower case and used to produce an array of
hashes. This is a variation of C<auto>.

 my $aoh = csv (in => $fh, headers => "lc");

=item uc
X<uc>

If C<uc> is used,  the first line of the  C<CSV> source will be read as the
list of field headers mapped to  upper case and used to produce an array of
hashes. This is a variation of C<auto>.

 my $aoh = csv (in => $fh, headers => "uc");

=item CODE
X<CODE>

If a coderef is used,  the first line of the  C<CSV> source will be read as
the list of mangled field headers in which each field is passed as the only
argument to the coderef. This list is used to produce an array of hashes.

CSV_XS.pm view on Meta::CPAN


=head2 Printing CSV data

=head3 The fast way: using L</print>

An example for creating C<CSV> files using the L</print> method:

 my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
 open my $fh, ">", "foo.csv" or die "foo.csv: $!";
 for (1 .. 10) {
     $csv->print ($fh, [ $_, "$_" ]) or $csv->error_diag;
     }
 close $fh or die "$tbl.csv: $!";

=head3 The slow way: using L</combine> and L</string>

or using the slower L</combine> and L</string> methods:

 my $csv = Text::CSV_XS->new;

 open my $csv_fh, ">", "hello.csv" or die "hello.csv: $!";

 my @sample_input_fields = (
     'You said, "Hello!"',   5.67,
     '"Surely"',   '',   '3.14159');
 if ($csv->combine (@sample_input_fields)) {
     print $csv_fh $csv->string, "\n";
     }
 else {
     print "combine () failed on argument: ",
         $csv->error_input, "\n";
     }
 close $csv_fh or die "hello.csv: $!";

=head3 Generating CSV into memory

Format a data-set (C<@foo>) into a scalar value in memory (C<$data>):

 # The data
 my @foo = map { [ 0 .. 5 ] } 0 .. 3;

 # in a loop
 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, eol => "\r\n" });
 open my $fh, ">", \my $data;
 $csv->print ($fh, $_) for @foo;
 close $fh;

 # a single call
 csv (in => \@foo, out => \my $data);

=head2 Rewriting CSV

=head3 Changing separator

Rewrite C<CSV> files with C<;> as separator character to well-formed C<CSV>:

 use Text::CSV_XS qw( csv );
 csv (in => csv (in => "bad.csv", sep_char => ";"), out => *STDOUT);

As C<STDOUT> is now default in L</csv>, a one-liner converting a UTF-16 CSV
file with BOM and TAB-separation to valid UTF-8 CSV could be:

 $ perl -C3 -MText::CSV_XS=csv -we\
    'csv(in=>"utf16tab.csv",encoding=>"utf16",sep=>"\t")' >utf8.csv

=head3 Unifying EOL

Rewrite a CSV file with mixed EOL  and/or inconsistent quotation into a new
CSV file with consistent EOL and quotation. Attributes apply.

 use Text::CSV_XS qw( csv );
 csv (in => "file.csv", out => "newfile.csv", quote_space => 1);

=head2 Dumping database tables to CSV

Dumping a database table can be simple as this (TIMTOWTDI):

 my $dbh = DBI->connect (...);
 my $sql = "select * from foo";

 # using your own loop
 open my $fh, ">", "foo.csv" or die "foo.csv: $!\n";
 my $csv = Text::CSV_XS->new ({ binary => 1, eol => "\r\n" });
 my $sth = $dbh->prepare ($sql); $sth->execute;
 $csv->print ($fh, $sth->{NAME_lc});
 while (my $row = $sth->fetch) {
     $csv->print ($fh, $row);
     }

 # using the csv function, all in memory
 csv (out => "foo.csv", in => $dbh->selectall_arrayref ($sql));

 # using the csv function, streaming with callbacks
 my $sth = $dbh->prepare ($sql); $sth->execute;
 csv (out => "foo.csv", in => sub { $sth->fetch            });
 csv (out => "foo.csv", in => sub { $sth->fetchrow_hashref });

Note that this does not discriminate between "empty" values and NULL-values
from the database,  as both will be the same empty field in CSV.  To enable
distinction between the two, use L<C<quote_empty>|/quote_empty>.

 csv (out => "foo.csv", in => sub { $sth->fetch }, quote_empty => 1);

If the database import utility supports special sequences to insert C<NULL>
values into the database,  like MySQL/MariaDB supports C<\N>,  use a filter
or a map

 csv (out => "foo.csv", in => sub { $sth->fetch },
                     on_in => sub { $_ //= "\\N" for @{$_[1]} });

 while (my $row = $sth->fetch) {
     $csv->print ($fh, [ map { $_ // "\\N" } @$row ]);
     }

Note that this will not work as expected when choosing the backslash (C<\>)
as C<escape_char>, as that will cause the C<\> to need to be escaped by yet
another C<\>,  which will cause the field to need quotation and thus ending
up as C<"\\N"> instead of C<\N>. See also L<C<undef_str>|/undef_str>.

 csv (out => "foo.csv", in => sub { $sth->fetch }, undef_str => "\\N");

CSV_XS.pm view on Meta::CPAN

=over 2

=item parser-xs.pl
X<parser-xs.pl>

This can be used as a boilerplate to parse invalid C<CSV>  and parse beyond
(expected) errors alternative to using the L</error> callback.

 $ perl examples/parser-xs.pl bad.csv >good.csv

=item csv-check
X<csv-check>

This is a command-line tool that uses parser-xs.pl  techniques to check the
C<CSV> file and report on its content.

 $ csv-check files/utf8.csv
 Checked files/utf8.csv  with csv-check 1.9
 using Text::CSV_XS 1.32 with perl 5.26.0 and Unicode 9.0.0
 OK: rows: 1, columns: 2
     sep = <,>, quo = <">, bin = <1>, eol = <"\n">

=item csv-split
X<csv-split>

This command splits C<CSV> files into smaller files,  keeping (part of) the
header.  Options include maximum number of (data) rows per file and maximum
number of columns per file or a combination of the two.

=item csv2xls
X<csv2xls>

A script to convert C<CSV> to Microsoft Excel (C<XLS>). This requires extra
modules L<Date::Calc> and L<Spreadsheet::WriteExcel>. The converter accepts
various options and can produce UTF-8 compliant Excel files.

=item csv2xlsx
X<csv2xlsx>

A script to convert C<CSV> to Microsoft Excel (C<XLSX>).  This requires the
modules L<Date::Calc> and L<Spreadsheet::Writer::XLSX>.  The converter does
accept various options including merging several C<CSV> files into a single
Excel file.

=item csvdiff
X<csvdiff>

A script that provides colorized diff on sorted CSV files,  assuming  first
line is header and first field is the key. Output options include colorized
ANSI escape codes or HTML.

 $ csvdiff --html --output=diff.html file1.csv file2.csv

=item rewrite.pl
X<rewrite.pl>

A script to rewrite (in)valid CSV into valid CSV files.  Script has options
to generate confusing CSV files or CSV files that conform to Dutch MS-Excel
exports (using C<;> as separation).

Script - by default - honors BOM  and auto-detects separation converting it
to default standard CSV with C<,> as separator.

=back

=head1 CAVEATS

Text::CSV_XS  is I<not> designed to detect the characters used to quote and
separate fields.  The parsing is done using predefined  (default) settings.
In the examples  sub-directory,  you can find scripts  that demonstrate how
you could try to detect these characters yourself.

=head2 Microsoft Excel

The import/export from Microsoft Excel is a I<risky task>, according to the
documentation in C<Text::CSV::Separator>.  Microsoft uses the system's list
separator defined in the regional settings, which happens to be a semicolon
for Dutch, German and Spanish (and probably some others as well).   For the
English locale,  the default is a comma.   In Windows however,  the user is
free to choose a  predefined locale,  and then change  I<every>  individual
setting in it, so checking the locale is no solution.

As of version 1.17, a lone first line with just

  sep=;

will be recognized and honored when parsing with L</getline>.

=head1 TODO

=over 2

=item More Errors & Warnings

New extensions ought to be  clear and concise  in reporting what  error has
occurred where and why, and maybe also offer a remedy to the problem.

L</error_diag> is a (very) good start, but there is more work to be done in
this area.

Basic calls  should croak or warn on  illegal parameters.  Errors should be
documented.

=item setting meta info

Future extensions might include extending the L</meta_info>, L</is_quoted>,
and  L</is_binary>  to accept setting these  flags for  fields,  so you can
specify which fields are quoted in the L</combine>/L</string> combination.

 $csv->meta_info (0, 1, 1, 3, 0, 0);
 $csv->is_quoted (3, 1);

L<Metadata Vocabulary for Tabular Data|http://w3c.github.io/csvw/metadata/>
(a W3C editor's draft) could be an example for supporting more metadata.

=item Parse the whole file at once

Implement new methods or functions  that enable parsing of a  complete file
at once, returning a list of hashes. Possible extension to this could be to
enable a column selection on the call:

( run in 0.872 second using v1.01-cache-2.11-cpan-39bf76dae61 )