DataExtract-FixedWidth
view release on metacpan or search on metacpan
lib/DataExtract/FixedWidth.pm view on Meta::CPAN
package DataExtract::FixedWidth;
use Moose;
use Carp;
our $VERSION = '0.09';
sub BUILD {
my $self = shift;
confess 'You must either send either a "header_row" or data for "heuristic"'
unless $self->has_header_row || $self->has_heuristic
;
confess 'You must send a "header_row" if you send "cols"'
if $self->has_cols && !$self->has_header_row && !$self->has_heuristic
;
}
has 'unpack_string' => (
isa => 'Str'
, is => 'rw'
, lazy_build => 1
);
has 'cols' => (
isa => 'ArrayRef'
, is => 'rw'
, auto_deref => 1
, lazy_build => 1
);
has 'colchar_map' => (
isa => 'HashRef'
, is => 'rw'
, lazy_build => 1
);
has 'header_row' => (
isa => 'Maybe[Str]'
, is => 'rw'
, predicate => 'has_header_row'
);
has 'first_col_zero' => (
isa => 'Bool'
, is => 'ro'
, default => 1
);
has 'fix_overlay' => (
isa => 'Bool'
, is => 'ro'
, default => 0
);
has 'trim_whitespace' => (
isa => 'Bool'
, is => 'ro'
, default => 1
);
has 'sorted_colstart' => (
isa => 'ArrayRef'
, is => 'ro'
, lazy_build => 1
, auto_deref => 1
);
has 'null_as_undef' => (
isa => 'Bool'
, is => 'ro'
, default => 0
);
has 'heuristic' => (
isa => 'ArrayRef'
, is => 'rw'
, predicate => 'has_heuristic'
, auto_deref => 1
, trigger => \&_heuristic_trigger
);
has 'skip_header_data' => (
isa => 'Bool'
, is => 'rw'
, default => 1
);
has 'verbose' => ( isa => 'Bool', 'is' => 'ro', default => 0 );
sub _heuristic_trigger {
my ( $self, $data ) = @_;
chomp @$data;
my $maxLength = 0;
for ( @$data ) {
$maxLength = length if length > $maxLength
}
$self->header_row( $data->[0] )
unless $self->has_header_row
;
{
my @unpack;
my $mask = ' ' x $maxLength;
$mask |= $_ for @$data;
## The (?=\S) fixes a bug that creates null columns in the event any
lib/DataExtract/FixedWidth.pm view on Meta::CPAN
}
$ccm;
}
sub _build_unpack_string {
my $self = shift;
my @unpack;
my @startcols = $self->sorted_colstart;
$startcols[0] = 0 if $self->first_col_zero;
foreach my $idx ( 0 .. $#startcols ) {
if ( exists $startcols[$idx+1] ) {
push @unpack, ( $startcols[$idx+1] - $startcols[$idx] );
}
}
$self->_helper_unpack( \@unpack );
}
## Takes ArrayRef of startcols and returns the unpack string.
sub _helper_unpack {
my ( $self, $startcols ) = @_;
my $format;
if ( @$startcols ) {
$format = 'a' . join 'a', @$startcols;
}
$format .= 'A*';
$format;
}
sub parse {
my ( $self, $data ) = @_;
return undef if !defined $data;
chomp $data;
## skip_header_data
if (
$self->skip_header_data
&& ( defined $self->header_row && $data eq $self->header_row )
) {
warn "Skipping duplicate header row\n" if $self->verbose;
return undef
}
#printf "\nData:|%s|\tHeader:|%s|", $data, $self->header_row;
my @cols = unpack ( $self->unpack_string, $data );
## If we bleed over a bit we can fix that.
if ( $self->fix_overlay ) {
foreach my $idx ( 0 .. $#cols ) {
if (
$cols[$idx] =~ m/\S+$/
&& exists $cols[$idx+1]
&& $cols[$idx+1] =~ s/^(\S+)//
) {
$cols[$idx] .= $1;
}
}
}
## Get rid of whitespaces
if ( $self->trim_whitespace ) {
for ( @cols ) { s/^\s+//; s/\s+$//; }
}
## Swithc nulls to undef
if ( $self->null_as_undef ) {
croak 'This ->null_as_undef option mandates ->trim_whitespace be true'
unless $self->trim_whitespace
;
for ( @cols ) { undef $_ unless length($_) }
}
\@cols;
}
sub parse_hash {
my ( $self, $data ) = @_;
my $row = $self->parse( $data );
my $colstarts = $self->sorted_colstart;
my $results;
foreach my $idx ( 0 .. $#$row ) {
my $col = $self->colchar_map->{ $colstarts->[$idx] };
$results->{ $col } = $row->[$idx];
}
$results;
}
sub _build_sorted_colstart {
my $self = shift;
my @startcols = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$_, sprintf( "%10d", $_ ) ] }
keys %{ $self->colchar_map }
;
\@startcols;
}
no Moose;
__PACKAGE__->meta->make_immutable;
lib/DataExtract/FixedWidth.pm view on Meta::CPAN
SAMPLE FILE
HEADER: 'COL1NAME COL2NAME COL3NAMEEEEE'
DATA1: 'FOOBARBAZ THIS IS TEXT ANHER COL '
DATA2: 'FOOBAR FOOBAR IS TEXT ANOTHER COL '
After you have constructed, you can C<-E<gt>parse> which will return an ArrayRef
$de->parse('FOOBARBAZ THIS IS TEXT ANOTHER COL');
Or, you can use C<-E<gt>parse_hash()> which returns a HashRef of the data indexed by the column headers. They can be determined in many ways with the data you provide.
=head2 Constructor
The class constructor, C<-E<gt>new>, has numerious forms. Some options it has are:
=over 12
=item heuristics => \@lines
This will deduce the unpack format string from data. If you opt to use this method, and need parse_hash, the first row of the heurisitic is assumed to be the header_row. The unpack_string that results for the heuristic is applied to the header_row to...
=item cols => \@cols
This will permit you to explicitly list the columns in the header row. This is especially handy if you have spaces in the column header. This option will make the C<header_row> mandatory.
=item header_row => $string
If a C<cols> option is not provided the assumption is that there are no spaces in the column header. The module can take care of the rest. The only way this column can be avoided is if we deduce the header from heuristics, or if you explicitly supply...
=item verbose => 1|0
Right now, it simply display's warnings when it does something that might at first seem awkward. Like returning undef when it encouters a duplicate copy of a header row.
=back
=head2 Methods
B<An astrisk, (*) in the option means that is the default.>
=over 12
=item ->parse( $data_line )
Parses the data and returns an ArrayRef
=item ->parse_hash( $data_line )
Parses the data and returns a HashRef, indexed by the I<cols> (headers)
=item ->first_col_zero(1*|0)
This option forces the unpack string to make the first column assume the characters to the left of the header column. So, in the below example the first column also includes the first char of the row, even though the word stock begins at the second c...
CHAR NUMBERS: |1|2|3|4|5|6|7|8|9|10
HEADER ROW : | |S|T|O|C|K| |V|I|N
=item ->trim_whitespace(*1|0)
Trim the whitespace for the elements that C<-E<gt>parse($line)> outputs.
=item ->fix_overlay(1|0*)
Fixes columns that bleed into other columns, move over all non-whitespace characters preceding the first whitespace of the next column. This does not work with heurisitic because the unpack string makes the assumption the data is not mangeled.
So if ColumnA as is 'foob' and ColumnB is 'ar Hello world'
* ColumnA becomes 'foobar', and ColumnB becomes 'Hello world'
=item ->null_as_undef(1|0*)
Simply undef all elements that return C<length(element) = 0>, requires C<-E<gt>trim_whitespace>.
=item ->skip_header_data(1*|0)
Skips duplicate copies of the header_row if found in the data.
=item ->colchar_map
Returns a HashRef that displays the results of each column header and relative character position the column starts at. In the case of heuristic this is a simple ordinal number. In the case of non-heuristic provided data it is currently a cardinal ch...
=item ->unpack_string
Returns the C<CORE::unpack()> template string that will be used internally by C<-E<gt>parse($line)>
=back
=head1 AVAILABILITY
CPAN.org
Git repo at L<http://repo.or.cz/w/DataExtract-FixedWidth.git>
=head1 COPYRIGHT & LICENSE
Copyright 2008 Evan, all rights reserved.
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
=head1 AUTHOR
Evan Carroll <me at evancarroll.com>
System Lord of the Internets
=head1 BUGS
Please report any bugs or feature requests to C<bug-dataexract-fixedwidth at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=DataExtract-FixedWidth>. I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.
=cut
( run in 1.388 second using v1.01-cache-2.11-cpan-13bb782fe5a )