Text-CSV_XS
view release on metacpan or search on metacpan
}
The old(er) way of using global file handles is still supported
while (my $row = $csv->getline (*ARGV)) { ... }
=head2 Unicode
Unicode is only tested to work with perl-5.8.2 and up.
See also L</BOM>.
The simplest way to ensure the correct encoding is used for in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.
open my $fh, "<:encoding(UTF-8)", "in.csv" or die "in.csv: $!";
or
my $aoa = csv (in => "in.csv", encoding => "UTF-8");
open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";
});
$csv = Text::CSV::Encoded->new ({ encoding => "utf8" });
# combine () and print () accept *literally* utf8 encoded data
# parse () and getline () return *literally* utf8 encoded data
$csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
=head2 BOM
BOM (or Byte Order Mark) handling is available only inside the L</header>
method. This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.
If a file has a BOM, the easiest way to deal with that is
my $aoh = csv (in => $file, detect_bom => 1);
All records will be encoded based on the detected BOM.
This implies a call to the L</header> method, which defaults to also set
the L</column_names>. So this is B<not> the same as
my $aoh = csv (in => $file, headers => "auto");
which only reads the first record to set L</column_names> but ignores any
meaning of possible present BOM.
=head1 SPECIFICATION
While no formal specification for CSV exists, L<RFC 4180|https://datatracker.ietf.org/doc/html/rfc4180>
(I<1>) describes the common format and establishes C<text/csv> as the MIME
type registered with the IANA. L<RFC 7111|https://datatracker.ietf.org/doc/html/rfc7111>
(I<2>) adds fragments to CSV.
Many informal documents exist that describe the C<CSV> format. L<"How To:
The Comma Separated Value (CSV) File Format"|http://creativyst.com/Doc/Articles/CSV/CSV01.shtml>
$csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);
Multi-byte sequences are allowed, both multi-character and Unicode. See
L<C<sep>|/sep>.
=item detect_bom
X<detect_bom>
$csv->header ($fh, { detect_bom => 1 });
The default behavior is to detect if the header line starts with a BOM. If
the header has a BOM, use that to set the encoding of C<$fh>. This default
behavior can be disabled by passing a false value to C<detect_bom>.
Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1, and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.
If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.
my $enc = $csv->{ENCODING};
The encoding is used with C<binmode> on C<$fh>.
If the handle was opened in a (correct) encoding, this method will B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.
=item munge_column_names
X<munge_column_names>
This option offers the means to modify the column names into something that
is most useful to the application. The default is to map all column names
to lower case.
$csv->header ($fh, { munge_column_names => "lc" });
=head3 encoding
X<encoding>
If passed, it should be an encoding accepted by the C<:encoding()> option
to C<open>. There is no default value. This attribute does not work in perl
5.6.x. C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.
If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly. This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.
Encodings can be stacked, as supported by C<binmode>:
# Using PerlIO::via::gzip
csv (in => \@csv,
out => "test.csv:via.gz",
encoding => ":via(gzip):encoding(utf-8)",
);
csv (in => \@csv,
out => "test.csv:via.gz",
encoding => ":gzip:encoding(utf-8)",
);
$aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");
=head3 detect_bom
X<detect_bom>
If C<detect_bom> is given, the method L</header> will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.
Note that the attribute L<C<headers>|/headers> can be used to overrule the
default behavior of how that method automatically sets the attribute.
C<detect_bom> can be abbreviated to C<bom>.
This is the same as setting L<C<encoding>|/encoding> to C<"auto">.
=head3 headers
X<headers>
=head2 Rewriting CSV
=head3 Changing separator
Rewrite C<CSV> files with C<;> as separator character to well-formed C<CSV>:
use Text::CSV_XS qw( csv );
csv (in => csv (in => "bad.csv", sep_char => ";"), out => *STDOUT);
As C<STDOUT> is now default in L</csv>, a one-liner converting a UTF-16 CSV
file with BOM and TAB-separation to valid UTF-8 CSV could be:
$ perl -C3 -MText::CSV_XS=csv -we\
'csv(in=>"utf16tab.csv",encoding=>"utf16",sep=>"\t")' >utf8.csv
=head3 Unifying EOL
Rewrite a CSV file with mixed EOL and/or inconsistent quotation into a new
CSV file with consistent EOL and quotation. Attributes apply.
use Text::CSV_XS qw( csv );
$ csvdiff --html --output=diff.html file1.csv file2.csv
=item rewrite.pl
X<rewrite.pl>
A script to rewrite (in)valid CSV into valid CSV files. Script has options
to generate confusing CSV files or CSV files that conform to Dutch MS-Excel
exports (using C<;> as separation).
Script - by default - honors BOM and auto-detects separation converting it
to default standard CSV with C<,> as separator.
=back
=head1 CAVEATS
Text::CSV_XS is I<not> designed to detect the characters used to quote and
separate fields. The parsing is done using predefined (default) settings.
In the examples sub-directory, you can find scripts that demonstrate how
you could try to detect these characters yourself.
* Make detect_bom result available
* It's 2018
* Add csv (out => \"skip") - suppress output deliberately
* Allow sub as top-level filter
* Tested against Test2::Harness-0.001062 (yath test)
* Tested against perl-5.27.10
1.34 - 2017-11-05, H.Merijn Brand
* Bad arg for formula (like "craok") will now die with error 1500
* Row report in formula reporting was off by 1
* Add a prominent section about BOM handling
* Make sheet label more portable (csv2xlsx)
* Allow munge => \%hash
* Preserve first row in csv (set_column_names => 0)
1.33 - 2017-10-19, H.Merijn Brand
* Small additional fix for eol = \r + BOM
* Updated doc for example files
* Add support for formula actions (issue 11)
- csv2xls and csv2xlsx now warn by default
* Reset file info on ->header call (RT#123320)
1.32 - 2017-09-15, H.Merijn Brand
* Add keep_headers attribute to csv ()
* Fix on_in when used in combination with key
* Fail on invalid arguments to csv
* Fix header method on EOL = CR (RT#122764)
1.31 - 2017-06-13, H.Merijn Brand
* Fix already decoded BOM in headers
* New options in csv-check
* Some perlcritic
* "escape" is alias for "escape_char" for consistency.
* Code cleanup and more tests (Devel::Cover)
* Improve csv-check auto-sep-detection
1.30 - 2017-06-08, H.Merijn Brand
* Add csv (..., out => ...) syntax examples (issue 7)
* Disable escape_null for undefined escape_char
* Fix ->say for bound columns (RT#121576)
examples/csv-check view on Meta::CPAN
the string "tab" is allowed.
-q Q --quo=Q use Q as quote char. Auto-detect, default = '"'
the string "undef" will disable quotation.
-e E --esc=E use E as escape char. Auto-detect, default = '"'
the string "undef" will disable escapes.
-N --nl force EOL to \\n
-C --cr force EOL to \\r
-M --crnl force EOL to \\r\\n
-u --utf-8 check if all fields are valid unicode
-E E --enc=E open file with encoding E
-h --hdr check with header (implies BOM)
-b --bom check with BOM (no header)
-f --skip-formula do not check formula's
--pp use Text::CSV_PP instead (cross-check)
-A a --attr=at:val pass attributes to parser
--at=val is also supported for know attributes
-L --list-attr list supported CSV attributes
-X --list-changes list attributes that changed from default
EOU
exit $err;
examples/rewrite.pl view on Meta::CPAN
sub usage {
my $err = shift and select STDERR;
print <<"EOH";
usage: $0 [-o file] [-s S] [-m] [-c] [-i] [file]
-o F --out=F output to file F (default STDOUT)
-s S --sep=S set input separator to S (default ; , TAB or |)
-m --ms output Dutch style MicroSoft CSV (; and \\r\\n)
-n --no-header CSV has no header line. If selected
- default input sep = ;
- BOM is not used/recognized
-c --confuse Use confusing separation and quoting characters
-i --invisible Use invisible separation and quotation sequences
EOH
exit $err;
} # usage
use Getopt::Long qw(:config bundling);
GetOptions (
"help|?" => sub { usage (0); },
"s|sep=s" => \my $in_sep,
BmPREVIOUS|5.003007||Viu
BmRARE|5.003007||Viu
BmUSEFUL|5.003007||Viu
BOL|5.003007||Viu
BOL_t8|5.035004||Viu
BOL_t8_p8|5.033003||Viu
BOL_t8_pb|5.033003||Viu
BOL_tb|5.035004||Viu
BOL_tb_p8|5.033003||Viu
BOL_tb_pb|5.033003||Viu
BOM_UTF8|5.025005|5.003007|p
BOM_UTF8_FIRST_BYTE|5.019004||Viu
BOM_UTF8_TAIL|5.019004||Viu
boolSV|5.004000|5.003007|p
boot_core_builtin|5.035007||Viu
boot_core_mro|5.009005||Viu
boot_core_PerlIO|5.007002||Viu
boot_core_UNIVERSAL|5.003007||Viu
BOUND|5.003007||Viu
BOUNDA|5.013009||Viu
BOUNDA_t8|5.035004||Viu
BOUNDA_t8_p8|5.033003||Viu
BOUNDA_t8_pb|5.033003||Viu
#ifndef isUTF8_CHAR
# define isUTF8_CHAR(s, e) ( \
(e) <= (s) || ! is_utf8_string(s, UTF8_SAFE_SKIP(s, e)) \
? 0 \
: UTF8SKIP(s))
#endif
#endif
#if 'A' == 65
#ifndef BOM_UTF8
# define BOM_UTF8 "\xEF\xBB\xBF"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
# define REPLACEMENT_CHARACTER_UTF8 "\xEF\xBF\xBD"
#endif
#elif '^' == 95
#ifndef BOM_UTF8
# define BOM_UTF8 "\xDD\x73\x66\x73"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
# define REPLACEMENT_CHARACTER_UTF8 "\xDD\x73\x73\x71"
#endif
#elif '^' == 176
#ifndef BOM_UTF8
# define BOM_UTF8 "\xDD\x72\x65\x72"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
# define REPLACEMENT_CHARACTER_UTF8 "\xDD\x72\x72\x70"
#endif
#else
# error Unknown character set
#endif
t/85_util.t view on Meta::CPAN
#$ebcdic and $has_enc = 0; # TODO
$csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 9 });
SKIP: {
$has_enc or skip "Encoding $enc not supported", $enc =~ m/^utf/ ? 10 : 9;
$csv->column_names (undef);
open my $fh, "<", $fnm;
binmode $fh;
ok (1, "$fnm opened for enc $enc");
ok ($csv->header ($fh), "headers with BOM for $enc");
$enc =~ m/^utf/ and is ($csv->{ENCODING}, uc $enc, "Encoding inquirable");
is (($csv->column_names)[1], "b${a_ring}r", "column name was decoded");
ok (my $row = $csv->getline_hr ($fh), "getline_hr");
is ($row->{"b${a_ring}r"}, "1 \x{20ac} each", "Returned in Unicode");
close $fh;
my $aoh;
ok ($aoh = csv (in => $fnm, bom => 1), "csv (bom => 1)");
is_deeply ($aoh,
t/85_util.t view on Meta::CPAN
is_deeply ($aoh,
[{ zoo => 1, "b${a_ring}r" => "1 \x{20ac} each" }], "Returned data auto");
}
SKIP: {
$has_enc or skip "Encoding $enc not supported", 7;
$csv->column_names (undef);
open my $fh, "<", $fnm;
$enc eq "none" or binmode $fh, ":encoding($enc)";
ok (1, "$fnm opened for enc $enc");
ok ($csv->header ($fh), "headers with BOM for $enc");
is (($csv->column_names)[1], "b${a_ring}r", "column name was decoded");
ok (my $row = $csv->getline_hr ($fh), "getline_hr");
is ($row->{"b${a_ring}r"}, "1 \x{20ac} each", "Returned in Unicode");
close $fh;
ok (my $aoh = csv (in => $fnm, bom => 1), "csv (bom => 1)");
is_deeply ($aoh,
[{ zoo => 1, "b${a_ring}r" => "1 \x{20ac} each" }], "Returned data");
}
t/91_csv_cb.t view on Meta::CPAN
[ { FOO => 1, BAR => 2, BAZ => 3 },
{ FOO => 2, BAR => "a b", BAZ => "" }],
"AOH with lc headers");
is_deeply (csv (in => $tfn, headers => sub { lcfirst uc $_[0] }),
[ { fOO => 1, bAR => 2, bAZ => 3 },
{ fOO => 2, bAR => "a b", bAZ => "" }],
"AOH with mangled headers");
}
SKIP: {
$] < 5.008001 and skip "No BOM support in $]", 1;
is_deeply (csv (in => $tfn, munge => { bar => "boo" }),
[{ baz => 3, boo => 2, foo => 1 },
{ baz => "", boo => "a b", foo => 2 }], "Munge with hash");
}
open $fh, ">>", $tfn or die "$tfn: $!";
print $fh <<"EOD";
3,3,3
4,5,6
5,7,9
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, # 8
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, # 9
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, # A
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, # B
-1,-1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, # C
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, # D
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, # E
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7,13, # F
) : ();
# Used for BOM testing
*byte_utf8a_to_utf8n = $ebcdic ? sub {
# Convert a UTF-8 byte sequence into the platform's native UTF-8
# equivalent, currently only UTF-8 and UTF-EBCDIC.
my $string = shift;
utf8::is_utf8 ($string) and return $string;
my $length = length $string;
#diag ($string);
#diag ($length);
( run in 0.497 second using v1.01-cache-2.11-cpan-131fc08a04b )