DataExtract-FixedWidth

 view release on metacpan or  search on metacpan

lib/DataExtract/FixedWidth.pm  view on Meta::CPAN

package DataExtract::FixedWidth;
use Moose;
use Carp;

our $VERSION = '0.09';

sub BUILD {
	my $self = shift;

	confess 'You must either send either a "header_row" or data for "heuristic"'
		unless $self->has_header_row || $self->has_heuristic
	;
	confess 'You must send a "header_row" if you send "cols"'
		if $self->has_cols && !$self->has_header_row && !$self->has_heuristic
	;

}

has 'unpack_string' => (
	isa          => 'Str'
	, is         => 'rw'
	, lazy_build => 1
);

has 'cols' => (
	isa            => 'ArrayRef'
	, is           => 'rw'
	, auto_deref   => 1
	, lazy_build   => 1
);

has 'colchar_map' => (
	isa          => 'HashRef'
	, is         => 'rw'
	, lazy_build => 1
);

has 'header_row' => (
	isa          => 'Maybe[Str]'
	, is         => 'rw'
	, predicate  => 'has_header_row'
);

has 'first_col_zero' => (
	isa       => 'Bool'
	, is      => 'ro'
	, default => 1
);

has 'fix_overlay' => (
	isa       => 'Bool'
	, is      => 'ro'
	, default => 0
);

has 'trim_whitespace' => (
	isa       => 'Bool'
	, is      => 'ro'
	, default => 1
);

has 'sorted_colstart' => (
	isa          => 'ArrayRef'
	, is         => 'ro'
	, lazy_build => 1
	, auto_deref => 1
);

has 'null_as_undef' => (
	isa       => 'Bool'
	, is      => 'ro'
	, default => 0
);

has 'heuristic' => (
	isa          => 'ArrayRef'
	, is         => 'rw'
	, predicate  => 'has_heuristic'
	, auto_deref => 1
	, trigger    => \&_heuristic_trigger
);

has 'skip_header_data' => (
	isa       => 'Bool'
	, is      => 'rw'
	, default => 1
);

has 'verbose' => ( isa => 'Bool', 'is' => 'ro', default => 0 );

sub _heuristic_trigger {
	my ( $self, $data ) = @_;

	chomp @$data;

	my $maxLength = 0;
	for ( @$data ) {
		$maxLength = length if length > $maxLength
	}

	$self->header_row( $data->[0] )
		unless $self->has_header_row
	;

	{
		my @unpack;
		my $mask = ' ' x $maxLength;
		$mask |= $_ for @$data;

		## The (?=\S) fixes a bug that creates null columns in the event any

lib/DataExtract/FixedWidth.pm  view on Meta::CPAN


	}

	$ccm;

}

sub _build_unpack_string {
	my $self = shift;

	my @unpack;
	my @startcols = $self->sorted_colstart;
	$startcols[0] = 0 if $self->first_col_zero;
	foreach my $idx ( 0 .. $#startcols ) {

		if ( exists $startcols[$idx+1] ) {
			push @unpack, ( $startcols[$idx+1] - $startcols[$idx] );
		}

	}

	$self->_helper_unpack( \@unpack );

}

## Takes ArrayRef of startcols and returns the unpack string.
sub _helper_unpack {
	my ( $self, $startcols ) = @_;

	my $format;
	if ( @$startcols ) {
		$format = 'a' . join 'a', @$startcols;
	}
	$format .= 'A*';

	$format;

}

sub parse {
	my ( $self, $data ) = @_;

	return undef if !defined $data;

	chomp $data;

	## skip_header_data
	if (
		$self->skip_header_data
		&& ( defined $self->header_row && $data eq $self->header_row )
	) {
		warn "Skipping duplicate header row\n" if $self->verbose;
		return undef
	}

	#printf "\nData:|%s|\tHeader:|%s|", $data, $self->header_row;

	my @cols = unpack ( $self->unpack_string, $data );

	## If we bleed over a bit we can fix that.
	if ( $self->fix_overlay ) {
		foreach my $idx ( 0 .. $#cols ) {
			if (
				$cols[$idx] =~ m/\S+$/
				&& exists $cols[$idx+1]
				&& $cols[$idx+1] =~ s/^(\S+)//
			) {
					$cols[$idx] .= $1;
			}
		}
	}

	## Get rid of whitespaces
	if ( $self->trim_whitespace ) {
		for ( @cols ) { s/^\s+//; s/\s+$//; }
	}

	## Swithc nulls to undef
	if ( $self->null_as_undef ) {
		croak 'This ->null_as_undef option mandates ->trim_whitespace be true'
			unless $self->trim_whitespace
		;
		for ( @cols ) { undef $_ unless length($_) }
	}

	\@cols;

}

sub parse_hash {
	my ( $self, $data ) = @_;

	my $row = $self->parse( $data );

	my $colstarts = $self->sorted_colstart;

	my $results;
	foreach my $idx ( 0 .. $#$row ) {
		my $col = $self->colchar_map->{ $colstarts->[$idx] };
		$results->{ $col } = $row->[$idx];
	}

	$results;

}

sub _build_sorted_colstart {
	my $self = shift;

	my @startcols = map { $_->[0] }
		sort { $a->[1] <=> $b->[1] }
		map { [$_, sprintf( "%10d", $_ ) ] }
		keys %{ $self->colchar_map }
	;

	\@startcols;

}

no Moose;
__PACKAGE__->meta->make_immutable;

lib/DataExtract/FixedWidth.pm  view on Meta::CPAN


	SAMPLE FILE
	HEADER:  'COL1NAME       COL2NAME       COL3NAMEEEEE'
	DATA1:   'FOOBARBAZ      THIS IS TEXT   ANHER COL   '
	DATA2:   'FOOBAR FOOBAR  IS TEXT        ANOTHER COL '

After you have constructed, you can C<-E<gt>parse> which will return an ArrayRef
	$de->parse('FOOBARBAZ THIS IS TEXT    ANOTHER COL');

Or, you can use C<-E<gt>parse_hash()> which returns a HashRef of the data indexed by the column headers. They can be determined in many ways with the data you provide.

=head2 Constructor

The class constructor, C<-E<gt>new>, has numerious forms. Some options it has are:

=over 12

=item heuristics => \@lines

This will deduce the unpack format string from data. If you opt to use this method, and need parse_hash, the first row of the heurisitic is assumed to be the header_row. The unpack_string that results for the heuristic is applied to the header_row to...

=item cols => \@cols

This will permit you to explicitly list the columns in the header row. This is especially handy if you have spaces in the column header. This option will make the C<header_row> mandatory.

=item header_row => $string

If a C<cols> option is not provided the assumption is that there are no spaces in the column header. The module can take care of the rest. The only way this column can be avoided is if we deduce the header from heuristics, or if you explicitly supply...

=item verbose => 1|0

Right now, it simply display's warnings when it does something that might at first seem awkward. Like returning undef when it encouters a duplicate copy of a header row.

=back

=head2 Methods

B<An astrisk, (*) in the option means that is the default.>

=over 12

=item ->parse( $data_line )

Parses the data and returns an ArrayRef

=item ->parse_hash( $data_line )

Parses the data and returns a HashRef, indexed by the I<cols> (headers)

=item ->first_col_zero(1*|0)

This option forces the unpack string to make the first column assume the characters to the left of the header column. So, in the below example the first column also includes the first char of the row, even though the word stock begins at the second c...

	CHAR NUMBERS: |1|2|3|4|5|6|7|8|9|10
	HEADER ROW  : | |S|T|O|C|K| |V|I|N

=item ->trim_whitespace(*1|0)

Trim the whitespace for the elements that C<-E<gt>parse($line)> outputs.

=item ->fix_overlay(1|0*)

Fixes columns that bleed into other columns, move over all non-whitespace characters preceding the first whitespace of the next column. This does not work with heurisitic because the unpack string makes the assumption the data is not mangeled.

So if ColumnA as is 'foob' and ColumnB is 'ar Hello world'

* ColumnA becomes 'foobar', and ColumnB becomes 'Hello world'

=item ->null_as_undef(1|0*)

Simply undef all elements that return C<length(element) = 0>, requires C<-E<gt>trim_whitespace>.

=item ->skip_header_data(1*|0)

Skips duplicate copies of the header_row if found in the data.

=item ->colchar_map

Returns a HashRef that displays the results of each column header and relative character position the column starts at. In the case of heuristic this is a simple ordinal number. In the case of non-heuristic provided data it is currently a cardinal ch...

=item ->unpack_string

Returns the C<CORE::unpack()> template string that will be used internally by C<-E<gt>parse($line)>

=back

=head1 AVAILABILITY

CPAN.org

Git repo at L<http://repo.or.cz/w/DataExtract-FixedWidth.git>

=head1 COPYRIGHT & LICENSE

Copyright 2008 Evan, all rights reserved.

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


=head1 AUTHOR

	Evan Carroll <me at evancarroll.com>
	System Lord of the Internets

=head1 BUGS

Please report any bugs or feature requests to C<bug-dataexract-fixedwidth at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=DataExtract-FixedWidth>.  I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.

=cut



( run in 1.388 second using v1.01-cache-2.11-cpan-13bb782fe5a )