Text-CSV
view release on metacpan or search on metacpan
lib/Text/CSV_PP.pm view on Meta::CPAN
$row->[2] =~ m/pattern/ or next; # 3rd field should match
push @rows, $row;
}
close $fh;
# and write as CSV
open $fh, ">:encoding(utf8)", "new.csv" or die "new.csv: $!";
$csv->say ($fh, $_) for @rows;
close $fh or die "new.csv: $!";
=head1 DESCRIPTION
Text::CSV_PP is a pure-perl module that provides facilities for the
composition and decomposition of comma-separated values. This is
(almost) compatible with much faster L<Text::CSV_XS>, and mainly
used as its fallback module when you use L<Text::CSV> module without
having installed Text::CSV_XS. If you don't have any reason to use
this module directly, use Text::CSV for speed boost and portability
(or maybe Text::CSV_XS when you write an one-off script and don't need
to care about portability).
The following caveats are taken from the doc of Text::CSV_XS.
=head2 Embedded newlines
B<Important Note>: The default behavior is to accept only ASCII characters
in the range from C<0x20> (space) to C<0x7E> (tilde). This means that the
fields can not contain newlines. If your data contains newlines embedded in
fields, or characters above C<0x7E> (tilde), or binary data, you B<I<must>>
set C<< binary => 1 >> in the call to L</new>. To cover the widest range of
parsing options, you will always want to set binary.
But you still have the problem that you have to pass a correct line to the
L</parse> method, which is more complicated from the usual point of usage:
my $csv = Text::CSV_PP->new ({ binary => 1, eol => $/ });
while (<>) { # WRONG!
$csv->parse ($_);
my @fields = $csv->fields ();
}
this will break, as the C<while> might read broken lines: it does not care
about the quoting. If you need to support embedded newlines, the way to go
is to B<not> pass L<C<eol>|/eol> in the parser (it accepts C<\n>, C<\r>,
B<and> C<\r\n> by default) and then
my $csv = Text::CSV_PP->new ({ binary => 1 });
open my $fh, "<", $file or die "$file: $!";
while (my $row = $csv->getline ($fh)) {
my @fields = @$row;
}
The old(er) way of using global file handles is still supported
while (my $row = $csv->getline (*ARGV)) { ... }
=head2 Unicode
Unicode is only tested to work with perl-5.8.2 and up.
See also L</BOM>.
The simplest way to ensure the correct encoding is used for in- and output
is by either setting layers on the filehandles, or setting the L</encoding>
argument for L</csv>.
open my $fh, "<:encoding(UTF-8)", "in.csv" or die "in.csv: $!";
or
my $aoa = csv (in => "in.csv", encoding => "UTF-8");
open my $fh, ">:encoding(UTF-8)", "out.csv" or die "out.csv: $!";
or
csv (in => $aoa, out => "out.csv", encoding => "UTF-8");
On parsing (both for L</getline> and L</parse>), if the source is marked
being UTF8, then all fields that are marked binary will also be marked UTF8.
On combining (L</print> and L</combine>): if any of the combining fields
was marked UTF8, the resulting string will be marked as UTF8. Note however
that all fields I<before> the first field marked UTF8 and contained 8-bit
characters that were not upgraded to UTF8, these will be C<bytes> in the
resulting string too, possibly causing unexpected errors. If you pass data
of different encoding, or you don't know if there is different encoding,
force it to be upgraded before you pass them on:
$csv->print ($fh, [ map { utf8::upgrade (my $x = $_); $x } @data ]);
For complete control over encoding, please use L<Text::CSV::Encoded>:
use Text::CSV::Encoded;
my $csv = Text::CSV::Encoded->new ({
encoding_in => "iso-8859-1", # the encoding comes into Perl
encoding_out => "cp1252", # the encoding comes out of Perl
});
$csv = Text::CSV::Encoded->new ({ encoding => "utf8" });
# combine () and print () accept *literally* utf8 encoded data
# parse () and getline () return *literally* utf8 encoded data
$csv = Text::CSV::Encoded->new ({ encoding => undef }); # default
# combine () and print () accept UTF8 marked data
# parse () and getline () return UTF8 marked data
=head2 BOM
BOM (or Byte Order Mark) handling is available only inside the L</header>
method. This method supports the following encodings: C<utf-8>, C<utf-1>,
C<utf-32be>, C<utf-32le>, C<utf-16be>, C<utf-16le>, C<utf-ebcdic>, C<scsu>,
C<bocu-1>, and C<gb-18030>. See L<Wikipedia|https://en.wikipedia.org/wiki/Byte_order_mark>.
If a file has a BOM, the easiest way to deal with that is
my $aoh = csv (in => $file, detect_bom => 1);
All records will be encoded based on the detected BOM.
This implies a call to the L</header> method, which defaults to also set
the L</column_names>. So this is B<not> the same as
my $aoh = csv (in => $file, headers => "auto");
which only reads the first record to set L</column_names> but ignores any
meaning of possible present BOM.
=head1 METHODS
This section is also taken from Text::CSV_XS.
=head2 version
(Class method) Returns the current module version.
=head2 new
(Class method) Returns a new instance of class Text::CSV_PP. The attributes
are described by the (optional) hash ref C<\%attr>.
my $csv = Text::CSV_PP->new ({ attributes ... });
The following attributes are available:
=head3 eol
my $csv = Text::CSV_PP->new ({ eol => $/ });
$csv->eol (undef);
my $eol = $csv->eol;
The end-of-line string to add to rows for L</print> or the record separator
for L</getline>.
When not passed in a B<parser> instance, the default behavior is to accept
C<\n>, C<\r>, and C<\r\n>, so it is probably safer to not specify C<eol> at
all. Passing C<undef> or the empty string behave the same.
When not passed in a B<generating> instance, records are not terminated at
all, so it is probably wise to pass something you expect. A safe choice for
C<eol> on output is either C<$/> or C<\r\n>.
Common values for C<eol> are C<"\012"> (C<\n> or Line Feed), C<"\015\012">
(C<\r\n> or Carriage Return, Line Feed), and C<"\015"> (C<\r> or Carriage
Return). The L<C<eol>|/eol> attribute cannot exceed 7 (ASCII) characters.
If both C<$/> and L<C<eol>|/eol> equal C<"\015">, parsing lines that end on
only a Carriage Return without Line Feed, will be L</parse>d correct.
=head3 eol_type
my $eol = $csv->eol_type;
This read-only method returns the internal state of what is considered the
valid EOL for parsing.
=head3 sep_char
my $csv = Text::CSV_PP->new ({ sep_char => ";" });
$csv->sep_char (";");
my $c = $csv->sep_char;
The char used to separate fields, by default a comma. (C<,>). Limited to a
single-byte character, usually in the range from C<0x20> (space) to C<0x7E>
(tilde). When longer sequences are required, use L<C<sep>|/sep>.
The separation character can not be equal to the quote character or to the
lib/Text/CSV_PP.pm view on Meta::CPAN
Assuming that the file opened for parsing has a header, and the header does
not contain problematic characters like embedded newlines, read the first
line from the open handle then auto-detect whether the header separates the
column names with a character from the allowed separator list.
If any of the allowed separators matches, and none of the I<other> allowed
separators match, set L<C<sep>|/sep> to that separator for the current
CSV_PP instance and use it to parse the first line, map those to lowercase,
and use that to set the instance L</column_names>:
my $csv = Text::CSV_PP->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<", "file.csv";
binmode $fh; # for Windows
$csv->header ($fh);
while (my $row = $csv->getline_hr ($fh)) {
...
}
If the header is empty, contains more than one unique separator out of the
allowed set, contains empty fields, or contains identical fields (after
folding), it will croak with error 1010, 1011, 1012, or 1013 respectively.
If the header contains embedded newlines or is not valid CSV in any other
way, this method will croak and leave the parse error untouched.
A successful call to C<header> will always set the L<C<sep>|/sep> of the
C<$csv> object. This behavior can not be disabled.
=head3 return value
On error this method will croak.
In list context, the headers will be returned whether they are used to set
L</column_names> or not.
In scalar context, the instance itself is returned. B<Note>: the values as
found in the header will effectively be B<lost> if C<set_column_names> is
false.
=head3 Options
=over 2
=item sep_set
$csv->header ($fh, { sep_set => [ ";", ",", "|", "\t" ] });
The list of legal separators defaults to C<[ ";", "," ]> and can be changed
by this option. As this is probably the most often used option, it can be
passed on its own as an unnamed argument:
$csv->header ($fh, [ ";", ",", "|", "\t", "::", "\x{2063}" ]);
Multi-byte sequences are allowed, both multi-character and Unicode. See
L<C<sep>|/sep>.
=item detect_bom
$csv->header ($fh, { detect_bom => 1 });
The default behavior is to detect if the header line starts with a BOM. If
the header has a BOM, use that to set the encoding of C<$fh>. This default
behavior can be disabled by passing a false value to C<detect_bom>.
Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and
UTF-32LE. BOM also supports UTF-1, UTF-EBCDIC, SCSU, BOCU-1, and GB-18030
but L<Encode> does not (yet). UTF-7 is not supported.
If a supported BOM was detected as start of the stream, it is stored in the
object attribute C<ENCODING>.
my $enc = $csv->{ENCODING};
The encoding is used with C<binmode> on C<$fh>.
If the handle was opened in a (correct) encoding, this method will B<not>
alter the encoding, as it checks the leading B<bytes> of the first line. In
case the stream starts with a decoded BOM (C<U+FEFF>), C<{ENCODING}> will be
C<""> (empty) instead of the default C<undef>.
=item munge_column_names
This option offers the means to modify the column names into something that
is most useful to the application. The default is to map all column names
to lower case.
$csv->header ($fh, { munge_column_names => "lc" });
The following values are available:
lc - lower case
uc - upper case
db - valid DB field names
none - do not change
\%hash - supply a mapping
\&cb - supply a callback
=over 2
=item Lower case
$csv->header ($fh, { munge_column_names => "lc" });
The header is changed to all lower-case
$_ = lc;
=item Upper case
$csv->header ($fh, { munge_column_names => "uc" });
The header is changed to all upper-case
$_ = uc;
=item Literal
$csv->header ($fh, { munge_column_names => "none" });
=item Hash
$csv->header ($fh, { munge_column_names => { foo => "sombrero" });
if a value does not exist, the original value is used unchanged
=item Database
$csv->header ($fh, { munge_column_names => "db" });
=over 2
=item -
lower-case
=item -
all sequences of non-word characters are replaced with an underscore
lib/Text/CSV_PP.pm view on Meta::CPAN
a false value.
If C<out> is set to a reference of the literal string C<"skip">, the output
will be suppressed completely, which might be useful in combination with a
filter for side effects only.
my %cache;
csv (in => "dump.csv",
out => \"skip",
on_in => sub { $cache{$_[1][1]}++ });
Currently, setting C<out> to any false value (C<undef>, C<"">, 0) will be
equivalent to C<\"skip">.
If the C<in> argument point to something to parse, and the C<out> is set to
a reference to an C<ARRAY> or a C<HASH>, the output is appended to the data
in the existing reference. The result of the parse should match what exists
in the reference passed. This might come handy when you have to parse a set
of files with similar content (like data stored per period) and you want to
collect that into a single data structure:
my %hash;
csv (in => $_, out => \%hash, key => "id") for sort glob "foo-[0-9]*.csv";
my @list; # List of arrays
csv (in => $_, out => \@list) for sort glob "foo-[0-9]*.csv";
my @list; # List of hashes
csv (in => $_, out => \@list, bom => 1) for sort glob "foo-[0-9]*.csv";
=head4 Streaming
If B<both> C<in> and C<out> are files, file handles or globs, streaming is
enforced by injecting an C<after_parse> callback that immediately uses the
L<C<say ()>|/say> method of the same instance to output the result and then
rejects the record.
If a C<after_parse> was already passed as attribute, that will be included
in the injected call. If C<on_in> was passed and C<after_parse> was not, it
will be used instead. If both were passed, C<on_in> is ignored.
The EOL of the first record of the C<in> source is consistently used as EOL
for all records in the C<out> destination.
The C<filter> attribute is not available.
All other attributes are shared for C<in> and C<out>, so you cannot define
different encodings for C<in> and C<out>. You need to pass a C<$fh>, where
C<binmode> was used to apply the encoding layers.
Note that this is work in progress and things might change.
=head3 encoding
If passed, it should be an encoding accepted by the C<:encoding()> option
to C<open>. There is no default value. This attribute does not work in perl
5.6.x. C<encoding> can be abbreviated to C<enc> for ease of use in command
line invocations.
If C<encoding> is set to the literal value C<"auto">, the method L</header>
will be invoked on the opened stream to check if there is a BOM and set the
encoding accordingly. This is equal to passing a true value in the option
L<C<detect_bom>|/detect_bom>.
Encodings can be stacked, as supported by C<binmode>:
# Using PerlIO::via::gzip
csv (in => \@csv,
out => "test.csv:via.gz",
encoding => ":via(gzip):encoding(utf-8)",
);
$aoa = csv (in => "test.csv:via.gz", encoding => ":via(gzip)");
# Using PerlIO::gzip
csv (in => \@csv,
out => "test.csv:via.gz",
encoding => ":gzip:encoding(utf-8)",
);
$aoa = csv (in => "test.csv:gzip.gz", encoding => ":gzip");
=head3 detect_bom
If C<detect_bom> is given, the method L</header> will be invoked on the
opened stream to check if there is a BOM and set the encoding accordingly.
C<detect_bom> can be abbreviated to C<bom>.
This is the same as setting L<C<encoding>|/encoding> to C<"auto">.
Note that as the method L</header> is invoked, its default is to also set
the headers.
=head3 headers
If this attribute is not given, the default behavior is to produce an array
of arrays.
If C<headers> is supplied, it should be an anonymous list of column names,
an anonymous hashref, a coderef, or a literal flag: C<auto>, C<lc>, C<uc>,
or C<skip>.
=over 2
=item skip
When C<skip> is used, the header will not be included in the output.
my $aoa = csv (in => $fh, headers => "skip");
C<skip> is invalid/ignored in combinations with L<C<detect_bom>|/detect_bom>.
=item auto
If C<auto> is used, the first line of the C<CSV> source will be read as the
list of field headers and used to produce an array of hashes.
my $aoh = csv (in => $fh, headers => "auto");
=item lc
If C<lc> is used, the first line of the C<CSV> source will be read as the
list of field headers mapped to lower case and used to produce an array of
hashes. This is a variation of C<auto>.
my $aoh = csv (in => $fh, headers => "lc");
=item uc
If C<uc> is used, the first line of the C<CSV> source will be read as the
list of field headers mapped to upper case and used to produce an array of
hashes. This is a variation of C<auto>.
my $aoh = csv (in => $fh, headers => "uc");
=item CODE
If a coderef is used, the first line of the C<CSV> source will be read as
the list of mangled field headers in which each field is passed as the only
argument to the coderef. This list is used to produce an array of hashes.
my $aoh = csv (in => $fh,
headers => sub { lc ($_[0]) =~ s/kode/code/gr });
this example is a variation of using C<lc> where all occurrences of C<kode>
( run in 0.586 second using v1.01-cache-2.11-cpan-39bf76dae61 )