perl

 view release on metacpan or  search on metacpan

pod/perlretut.pod  view on Meta::CPAN

stopped before we got to it - at a given character position, leftmost
wins.  Second, we were able to get a match at the first character
position of the string C<'a'>.  If there were no matches at the first
position, Perl would move to the second character position C<'b'> and
attempt the match all over again.  Only when all possible paths at all
possible character positions have been exhausted does Perl give
up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false.

Even with all this work, regexp matching happens remarkably fast.  To
speed things up, Perl compiles the regexp into a compact sequence of
opcodes that can often fit inside a processor cache.  When the code is
executed, these opcodes can then run at full throttle and search very
quickly.

=head2 Extracting matches

The grouping metacharacters C<()> also serve another completely
different function: they allow the extraction of the parts of a string
that matched.  This is very useful to find out what matched and for
text processing in general.  For each grouping, the part that matched
inside goes into the special variables C<$1>, C<$2>, I<etc>.  They can be
used just as ordinary variables:

    # extract hours, minutes, seconds
    if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
	$hours = $1;
	$minutes = $2;
	$seconds = $3;
    }

Now, we know that in scalar context,
S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false
value.  In list context, however, it returns the list of matched values
C<($1,$2,$3)>.  So we could write the code more compactly as

    # extract hours, minutes, seconds
    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

If the groupings in a regexp are nested, C<$1> gets the group with the
leftmost opening parenthesis, C<$2> the next opening parenthesis,
I<etc>.  Here is a regexp with nested groups:

    /(ab(cd|ef)((gi)|j))/;
     1  2      34

If this regexp matches, C<$1> contains a string starting with
C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either
C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>,
or it remains undefined.

For convenience, Perl sets C<$+> to the string held by the highest numbered
C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the
value of the C<$1>, C<$2>,... most-recently assigned; I<i.e.> the C<$1>,
C<$2>,... associated with the rightmost closing parenthesis used in the
match).


=head2 Backreferences

Closely associated with the matching variables C<$1>, C<$2>, ... are
the I<backreferences> C<\g1>, C<\g2>,...  Backreferences are simply
matching variables that can be used I<inside> a regexp.  This is a
really nice feature; what matches later in a regexp is made to depend on
what matched earlier in the regexp.  Suppose we wanted to look
for doubled words in a text, like "the the".  The following regexp finds
all 3-letter doubles with a space in between:

    /\b(\w\w\w)\s\g1\b/;

The grouping assigns a value to C<\g1>, so that the same 3-letter sequence
is used for both parts.

A similar task is to find words consisting of two identical parts:

    % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words
    beriberi
    booboo
    coco
    mama
    murmur
    papa

The regexp has a single grouping which considers 4-letter
combinations, then 3-letter combinations, I<etc>., and uses C<\g1> to look for
a repeat.  Although C<$1> and C<\g1> represent the same thing, care should be
taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp
and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing
so may lead to surprising and unsatisfactory results.


=head2 Relative backreferences

Counting the opening parentheses to get the correct number for a
backreference is error-prone as soon as there is more than one
capturing group.  A more convenient technique became available
with Perl 5.10: relative backreferences. To refer to the immediately
preceding capture group one now may write C<\g-1> or C<\g{-1}>, the next but
last is available via C<\g-2> or C<\g{-2}>, and so on.

Another good reason in addition to readability and maintainability
for using relative backreferences is illustrated by the following example,
where a simple pattern for matching peculiar strings is used:

    $a99a = '([a-z])(\d)\g2\g1';   # matches a11a, g22g, x33x, etc.

Now that we have this pattern stored as a handy string, we might feel
tempted to use it as a part of some other pattern:

    $line = "code=e99e";
    if ($line =~ /^(\w+)=$a99a$/){   # unexpected behavior!
        print "$1 is valid\n";
    } else {
        print "bad line: '$line'\n";
    }

But this doesn't match, at least not the way one might expect. Only
after inserting the interpolated C<$a99a> and looking at the resulting
full text of the regexp is it obvious that the backreferences have
backfired. The subexpression C<(\w+)> has snatched number 1 and
demoted the groups in C<$a99a> by one rank. This can be avoided by
using relative backreferences:

    $a99a = '([a-z])(\d)\g{-1}\g{-2}';  # safe for being interpolated


=head2 Named backreferences

Perl 5.10 also introduced named capture groups and named backreferences.
To attach a name to a capturing group, you write either
C<< (?<name>...) >> or C<< (?'name'...) >>.  The backreference may
then be written as C<\g{name}>.  It is permissible to attach the
same name to more than one group, but then only the leftmost one of the
eponymous set can be referenced.  Outside of the pattern a named
capture group is accessible through the C<%+> hash.

Assuming that we have to match calendar dates which may be given in one
of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write
three suitable patterns where we use C<'d'>, C<'m'> and C<'y'> respectively as the
names of the groups capturing the pertaining components of a date. The
matching operation combines the three patterns as alternatives:

    $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
    $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
    $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
    for my $d (qw(2006-10-21 15.01.2007 10/31/2005)) {
        if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
            print "day=$+{d} month=$+{m} year=$+{y}\n";
        }
    }

If any of the alternatives matches, the hash C<%+> is bound to contain the
three key-value pairs.


=head2 Alternative capture group numbering

Yet another capturing group numbering technique (also as from Perl 5.10)
deals with the problem of referring to groups within a set of alternatives.
Consider a pattern for matching a time of the day, civil or military style:

    if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){
        # process hour and minute
    }

Processing the results requires an additional if statement to determine
whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would
be easier if we could use group numbers 1 and 2 in second alternative as
well, and this is exactly what the parenthesized construct C<(?|...)>,
set around an alternative achieves. Here is an extended version of the
previous pattern:

  if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){
      print "hour=$1 minute=$2 zone=$3\n";
  }

Within the alternative numbering group, group numbers start at the same
position for each alternative. After the group, numbering continues
with one higher than the maximum reached across all the alternatives.

=head2 Position information

In addition to what was matched, Perl also provides the
positions of what was matched as contents of the C<@-> and C<@+>
arrays. C<$-[0]> is the position of the start of the entire match and
C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
position of the start of the C<$n> match and C<$+[n]> is the position
of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
this code

    $x = "Mmm...donut, thought Homer";
    $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches

 view all matches for this distribution
 view release on metacpan -  search on metacpan

( run in 1.189 second using v1.00-cache-2.02-grep-82fe00e-cpan-1925d2aa809 )