Acme-InputRecordSeparatorIsRegexp

 view release on metacpan or  search on metacpan

lib/Acme/InputRecordSeparatorIsRegexp.pm  view on Meta::CPAN

package Acme::InputRecordSeparatorIsRegexp;

use 5.006;
use strict;
use warnings FATAL => 'all';
use Symbol;
use Carp;
use IO::Handle;
require Exporter;
our @ISA = 'Exporter';
our @EXPORT_OK = ('open','autochomp','input_record_separator','binmode');
our %EXPORT_TAGS = ( all =>  [ @EXPORT_OK ] );

BEGIN {
    no strict 'refs';
    *{ 'Acme::IRSRegexp' . "::" } = \*{ __PACKAGE__ . "::" };
}

our $VERSION = '0.07';

sub TIEHANDLE {
    my ($pkg, @opts) = @_;
    my $handle;
    if (@opts % 2) {
	$handle = Symbol::gensym;
    } else {
	my $fh = *{shift @opts};
        # will fail if open for $fh failed, but that's not important
	eval { CORE::open $handle, '<&+', $fh };
    }
    my $rs = shift @opts;
    my %opts = @opts;
    $opts{maxrecsize} ||= ($opts{bufsize} || 16384) / 4;
    $opts{bufsize} ||= $opts{maxrecsize} * 4;
    my $self = bless {
	%opts,
	handle => $handle,
	rs => $rs,
	records => [],
	buffer => ''
    }, $pkg;
    $self->_compile_rs;
    return $self;
}

# We abuse the PerlIO layers syntax to attach
# a regexp specification to a filehandle. This
# function extracts an ':irs(REGEXP)' layer from
# a string.
sub _extract_irs {
    my ($mode) = @_;
    
    my $irs = "";
    my $p0 = index($mode,":irs(");
    my $p1 = $p0 + 5;
    my $nest = 1;
    while ($nest) {
        my $c = eval { substr($mode,$p1++,1) };
        if ($@ || !defined($c)) {
            carp "Argument list not closed for PerlIO layer \"$irs\"";
            return;
        }
        if ($c eq "\\") {
            $c .= substr($mode,$p1++,1);
        }
        if ($c eq "(") { $nest++ }
        if ($c eq ")") { $nest-- }
        if ($nest) { $irs .= $c; }
    }
    substr($mode,$p0,length($irs)+6, "");
    $_[0] = $mode;
    return $irs;
}

sub open (*;$@) {
    no strict 'refs';        # or else bareword file handles will break
    my (undef,$mode,$expr,@list) = @_;
    if (!defined $_[0]) {
        $_[0] = Symbol::gensym;
    }
    my $glob = $_[0];
    if (!ref($glob) && $glob !~ /::/) {
        $glob = join("::",caller(0) || "", $glob);
    }

    if ($mode && index($mode,":irs(") >= 0) {
        my $irs = _extract_irs($mode);
        my $z = @list ? CORE::open *$glob, $mode, $expr, @list
                      : CORE::open *$glob, $mode, $expr;
        tie *$glob, __PACKAGE__, *$glob, $irs;
        return $z;
    }
    if (@list) {
        return CORE::open(*$glob,$mode,$expr,@list);
    } elsif ($expr) {
        return CORE::open(*$glob,$mode,$expr);
    } elsif ($mode) {
        return CORE::open(*$glob,$mode);
    } else {
        return CORE::open(*$glob);
    }
}

sub binmode (*;$) {
    my ($glob,$mode) = @_;
    $mode ||= ":raw";

lib/Acme/InputRecordSeparatorIsRegexp.pm  view on Meta::CPAN


=head1 DESCRIPTION

In the section about the L<"input record separator"|perlvar/"$/">,
C<perlvar> famously quips

=over 4

Remember: the value of $/ is a string, not a regex. B<awk>
has to be better for something. :-)

=back

This module provides a mechanism to read records from a file
using a regular expression as a record separator.

A common use case for this module is to read a text file 
that you don't know whether it uses Unix (C<\n>), 
Windows/DOS (C<\r\n>), or Mac (C<\r>) style line-endings, 
or even if it might contain all three. To properly parse
this file, you could tie its file handle to this package with
the appropriate regular expression:

    my $fh = Symbol::gensym;
    tie *$fh, 'Acme::InputRecordSeparatorIsRegexp', '\r\n|\r|\n';
    open $fh, '<', 'file-with-ambiguous-line-endings';

    @lines = <$fh>;
    # or
    while (my $line = <$fh>) { ... }

The lines produced by the C<< <$fh> >> expression, like the
builtin C<readline> function and operator, include the record
separator at the end of the line, so the lines returned may end
in C<\r\n>, C<\r>, or C<\n>.

Another use case is files that contain multiple types of records
where a different sequence of characters is used to denote the
end of different types of records.

=head1 tie STATEMENT

A typical use of this package might look like

    my $fh = Symbol::gensym;
    tie *$fh, 'Acme::InputRecordSeparatorIsRegexp', $record_sep_regex;
    open $fh, '<', $filename;

where C<$record_sep_regexp> is a string or a C<Regexp> object 
(specified with the 
L<< C<qr/.../>|"Quote and quote-like operators"/perlop >> notation)
containing the regular expression
you want to use for a file's line endings. Also see the convenience
method L<"open"> for an alternate way to obtain a file handle with
the features of this package.

=head1 FUNCTIONS

=head2 open

Another way of using this package to attach a regular expression
to the input record separator of a file handle, available since
v0.04,  is to import this package's C<open> function and to
specify an C<:irs(...)> I<pseudo-layer>.

   use Acme::InputRecordSeparatorIsRegexp 'open';
   $result = open FILEHANDLE, "<:irs(REGEXP)", EXPR
   $result = open FILEHANDLE, "<:irs(REGEXP)", EXPR, LIST
   $result = open FILEHANDLE, "<:irs(REGEXP)", REFERENCE

   $result = open my $fh, "<:irs(\r|\n|\r\n)", "ambiguous-line-endings.txt"

The C<:irs(...)> layer may be combined with other layers.

   open my $fh, "<:encoding(UTF-16):irs(\R)", "ambiguous.txt"

See also: L<"binmode">

=head2 autochomp

Returns the current setting, or sets the C<autochomp> attribute
of a file handle associated with this package. When the
C<autochomp> attribute of the file handle is enabled, any lines
read from the file handle through the C<readline> function
or C<< <> >> operator will be returned with the (custom) line
endings automatically removed.

    use Acme::InputRecordSeparatorIsRegexp 'open','autochomp';
    open my $fh, '<:irs(\R)', 'ambiguous.txt';
    autochomp($fh,1);           # enable autochomp
    my $is_autochomped = autochomp($fh);
    autochomp(tied(*$fh), 0);   # disable

This function can also be called as a method on the I<tied>
file handle.

    (tied *$fh)->autochomp(1);  # enable
    $fh->autochomp(0);          # not OK, must use tied handle

Enabling C<autochomp> with this function on a regular file handle
will tie the file handle into this package using the current
value of C<$/> as the handle's record separator. If you are
just looking for autochomp functionality and don't care about
applying regular expressions to determine line endings, this
function provides an (inefficient) way to do that to
arbitrary file handles.

The default attribute value is false.

=head2 binmode FILEHANDLE, LAYER

Overrides Perl's builtin L<binmode|perlfunc/"binmode"> function. 
If the I<pseudo-layer> C<:irs(...)> is specified, then apply the 
given regular expression as the dynamic input record separator for 
the given filehandle.
Any other layers specified are passed to Perl's builtin C<binmode>
function.


=head2 input_record_separator



( run in 1.293 second using v1.01-cache-2.11-cpan-39bf76dae61 )