perl

 view release on metacpan or  search on metacpan

pod/perlretut.pod  view on Meta::CPAN

stopped before we got to it - at a given character position, leftmost
wins.  Second, we were able to get a match at the first character
position of the string C<'a'>.  If there were no matches at the first
position, Perl would move to the second character position C<'b'> and
attempt the match all over again.  Only when all possible paths at all
possible character positions have been exhausted does Perl give
up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false.

Even with all this work, regexp matching happens remarkably fast.  To
speed things up, Perl compiles the regexp into a compact sequence of
opcodes that can often fit inside a processor cache.  When the code is
executed, these opcodes can then run at full throttle and search very
quickly.

=head2 Extracting matches

The grouping metacharacters C<()> also serve another completely
different function: they allow the extraction of the parts of a string
that matched.  This is very useful to find out what matched and for
text processing in general.  For each grouping, the part that matched
inside goes into the special variables C<$1>, C<$2>, I<etc>.  They can be
used just as ordinary variables:

    # extract hours, minutes, seconds
    if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
	$hours = $1;
	$minutes = $2;
	$seconds = $3;
    }

Now, we know that in scalar context,
S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false
value.  In list context, however, it returns the list of matched values
C<($1,$2,$3)>.  So we could write the code more compactly as

    # extract hours, minutes, seconds
    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

If the groupings in a regexp are nested, C<$1> gets the group with the
leftmost opening parenthesis, C<$2> the next opening parenthesis,
I<etc>.  Here is a regexp with nested groups:

    /(ab(cd|ef)((gi)|j))/;
     1  2      34

If this regexp matches, C<$1> contains a string starting with
C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either
C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>,
or it remains undefined.

For convenience, Perl sets C<$+> to the string held by the highest numbered
C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the
value of the C<$1>, C<$2>,... most-recently assigned; I<i.e.> the C<$1>,
C<$2>,... associated with the rightmost closing parenthesis used in the
match).


=head2 Backreferences

Closely associated with the matching variables C<$1>, C<$2>, ... are
the I<backreferences> C<\g1>, C<\g2>,...  Backreferences are simply
matching variables that can be used I<inside> a regexp.  This is a
really nice feature; what matches later in a regexp is made to depend on
what matched earlier in the regexp.  Suppose we wanted to look
for doubled words in a text, like "the the".  The following regexp finds
all 3-letter doubles with a space in between:

    /\b(\w\w\w)\s\g1\b/;

The grouping assigns a value to C<\g1>, so that the same 3-letter sequence
is used for both parts.

A similar task is to find words consisting of two identical parts:

    % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words
    beriberi
    booboo
    coco
    mama
    murmur
    papa

The regexp has a single grouping which considers 4-letter
combinations, then 3-letter combinations, I<etc>., and uses C<\g1> to look for
a repeat.  Although C<$1> and C<\g1> represent the same thing, care should be
taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp
and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing
so may lead to surprising and unsatisfactory results.


=head2 Relative backreferences

Counting the opening parentheses to get the correct number for a
backreference is error-prone as soon as there is more than one
capturing group.  A more convenient technique became available
with Perl 5.10: relative backreferences. To refer to the immediately
preceding capture group one now may write C<\g-1> or C<\g{-1}>, the next but
last is available via C<\g-2> or C<\g{-2}>, and so on.

Another good reason in addition to readability and maintainability
for using relative backreferences is illustrated by the following example,
where a simple pattern for matching peculiar strings is used:

    $a99a = '([a-z])(\d)\g2\g1';   # matches a11a, g22g, x33x, etc.

Now that we have this pattern stored as a handy string, we might feel
tempted to use it as a part of some other pattern:

    $line = "code=e99e";
    if ($line =~ /^(\w+)=$a99a$/){   # unexpected behavior!
        print "$1 is valid\n";
    } else {
        print "bad line: '$line'\n";
    }

But this doesn't match, at least not the way one might expect. Only
after inserting the interpolated C<$a99a> and looking at the resulting
full text of the regexp is it obvious that the backreferences have
backfired. The subexpression C<(\w+)> has snatched number 1 and
demoted the groups in C<$a99a> by one rank. This can be avoided by
using relative backreferences:

    $a99a = '([a-z])(\d)\g{-1}\g{-2}';  # safe for being interpolated


=head2 Named backreferences

Perl 5.10 also introduced named capture groups and named backreferences.
To attach a name to a capturing group, you write either
C<< (?<name>...) >> or C<< (?'name'...) >>.  The backreference may
then be written as C<\g{name}>.  It is permissible to attach the
same name to more than one group, but then only the leftmost one of the
eponymous set can be referenced.  Outside of the pattern a named
capture group is accessible through the C<%+> hash.

Assuming that we have to match calendar dates which may be given in one
of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write
three suitable patterns where we use C<'d'>, C<'m'> and C<'y'> respectively as the
names of the groups capturing the pertaining components of a date. The
matching operation combines the three patterns as alternatives:

    $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
    $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
    $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
    for my $d (qw(2006-10-21 15.01.2007 10/31/2005)) {
        if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
            print "day=$+{d} month=$+{m} year=$+{y}\n";
        }
    }

If any of the alternatives matches, the hash C<%+> is bound to contain the
three key-value pairs.


=head2 Alternative capture group numbering

Yet another capturing group numbering technique (also as from Perl 5.10)
deals with the problem of referring to groups within a set of alternatives.
Consider a pattern for matching a time of the day, civil or military style:

    if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){
        # process hour and minute
    }

Processing the results requires an additional if statement to determine
whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would
be easier if we could use group numbers 1 and 2 in second alternative as
well, and this is exactly what the parenthesized construct C<(?|...)>,
set around an alternative achieves. Here is an extended version of the
previous pattern:

  if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){
      print "hour=$1 minute=$2 zone=$3\n";
  }

Within the alternative numbering group, group numbers start at the same
position for each alternative. After the group, numbering continues
with one higher than the maximum reached across all the alternatives.

=head2 Position information

In addition to what was matched, Perl also provides the
positions of what was matched as contents of the C<@-> and C<@+>
arrays. C<$-[0]> is the position of the start of the entire match and
C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
position of the start of the C<$n> match and C<$+[n]> is the position
of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
this code

    $x = "Mmm...donut, thought Homer";
    $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches

pod/perlretut.pod  view on Meta::CPAN


    "hello" =~ /(hi|hello)/n; # $1 is not set!

See L<perlre/"n"> for more information.

=head2 Matching repetitions

The examples in the previous section display an annoying weakness.  We
were only matching 3-letter words, or chunks of words of 4 letters or
less.  We'd like to be able to match words or, more generally, strings
of any length, without writing out tedious alternatives like
C<\w\w\w\w|\w\w\w|\w\w|\w>.

This is exactly the problem the I<quantifier> metacharacters C<'?'>,
C<'*'>, C<'+'>, and C<{}> were created for.  They allow us to delimit the
number of repeats for a portion of a regexp we consider to be a
match.  Quantifiers are put immediately after the character, character
class, or grouping that we want to specify.  They have the following
meanings:

=over 4

=item *

C<a?> means: match C<'a'> 1 or 0 times

=item *

C<a*> means: match C<'a'> 0 or more times, I<i.e.>, any number of times

=item *

C<a+> means: match C<'a'> 1 or more times, I<i.e.>, at least once

=item *

C<a{n,m}> means: match at least C<n> times, but not more than C<m>
times.

=item *

C<a{n,}> means: match at least C<n> or more times

=item *

C<a{,n}> means: match at most C<n> times, or fewer

=item *

C<a{n}> means: match exactly C<n> times

=back

If you like, you can add blanks (tab or space characters) within the
braces, but adjacent to them, and/or next to the comma (if any).

Here are some examples:

    /[a-z]+\s+\d*/;  # match a lowercase word, at least one space, and
                     # any number of digits
    /(\w+)\s+\g1/;    # match doubled words of arbitrary length
    /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'
    $year =~ /^\d{2,4}$/;  # make sure year is at least 2 but not more
                           # than 4 digits
    $year =~ /^\d{ 2, 4 }$/;    # Same; for those who like wide open
                                # spaces.
    $year =~ /^\d{2, 4}$/;      # Same.
    $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates
    $year =~ /^\d{2}(\d{2})?$/; # same thing written differently.
                                # However, this captures the last two
                                # digits in $1 and the other does not.

    % simple_grep '^(\w+)\g1$' /usr/dict/words   # isn't this easier?
    beriberi
    booboo
    coco
    mama
    murmur
    papa

For all of these quantifiers, Perl will try to match as much of the
string as possible, while still allowing the regexp to succeed.  Thus
with C</a?.../>, Perl will first try to match the regexp with the C<'a'>
present; if that fails, Perl will try to match the regexp without the
C<'a'> present.  For the quantifier C<'*'>, we get the following:

    $x = "the cat in the hat";
    $x =~ /^(.*)(cat)(.*)$/; # matches,
                             # $1 = 'the '
                             # $2 = 'cat'
                             # $3 = ' in the hat'

Which is what we might expect, the match finds the only C<cat> in the
string and locks onto it.  Consider, however, this regexp:

    $x =~ /^(.*)(at)(.*)$/; # matches,
                            # $1 = 'the cat in the h'
                            # $2 = 'at'
                            # $3 = ''   (0 characters match)

One might initially guess that Perl would find the C<at> in C<cat> and
stop there, but that wouldn't give the longest possible string to the
first quantifier C<.*>.  Instead, the first quantifier C<.*> grabs as
much of the string as possible while still having the regexp match.  In
this example, that means having the C<at> sequence with the final C<at>
in the string.  The other important principle illustrated here is that,
when there are two or more elements in a regexp, the I<leftmost>
quantifier, if there is one, gets to grab as much of the string as
possible, leaving the rest of the regexp to fight over scraps.  Thus in
our example, the first quantifier C<.*> grabs most of the string, while
the second quantifier C<.*> gets the empty string.   Quantifiers that
grab as much of the string as possible are called I<maximal match> or
I<greedy> quantifiers.

When a regexp can match a string in several different ways, we can use
the principles above to predict which way the regexp will match:

=over 4

=item *

Principle 0: Taken as a whole, any regexp will be matched at the
earliest possible position in the string.

=item *

Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
that allows a match for the whole regexp will be the one used.

=item *

Principle 2: The maximal matching quantifiers C<'?'>, C<'*'>, C<'+'> and
C<{n,m}> will in general match as much of the string as possible while

pod/perlretut.pod  view on Meta::CPAN

regexp.

The ability of an independent subexpression to prevent backtracking
can be quite useful.  Suppose we want to match a non-empty string
enclosed in parentheses up to two levels deep.  Then the following
regexp matches:

    $x = "abc(de(fg)h";  # unbalanced parentheses
    $x =~ /\( ( [ ^ () ]+ | \( [ ^ () ]* \) )+ \)/xx;

The regexp matches an open parenthesis, one or more copies of an
alternation, and a close parenthesis.  The alternation is two-way, with
the first alternative C<[^()]+> matching a substring with no
parentheses and the second alternative C<\([^()]*\)>  matching a
substring delimited by parentheses.  The problem with this regexp is
that it is pathological: it has nested indeterminate quantifiers
of the form C<(a+|b)+>.  We discussed in Part 1 how nested quantifiers
like this could take an exponentially long time to execute if there
is no match possible.  To prevent the exponential blowup, we need to
prevent useless backtracking at some point.  This can be done by
enclosing the inner quantifier as an independent subexpression:

    $x =~ /\( ( (?> [ ^ () ]+ ) | \([ ^ () ]* \) )+ \)/xx;

Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
by gobbling up as much of the string as possible and keeping it.   Then
match failures fail much more quickly.


=head2 Conditional expressions

A I<conditional expression> is a form of if-then-else statement
that allows one to choose which patterns are to be matched, based on
some condition.  There are two types of conditional expression:
C<(?(I<condition>)I<yes-regexp>)> and
C<(?(condition)I<yes-regexp>|I<no-regexp>)>.
C<(?(I<condition>)I<yes-regexp>)> is
like an S<C<'if () {}'>> statement in Perl.  If the I<condition> is true,
the I<yes-regexp> will be matched.  If the I<condition> is false, the
I<yes-regexp> will be skipped and Perl will move onto the next regexp
element.  The second form is like an S<C<'if () {} else {}'>> statement
in Perl.  If the I<condition> is true, the I<yes-regexp> will be
matched, otherwise the I<no-regexp> will be matched.

The I<condition> can have several forms.  The first form is simply an
integer in parentheses C<(I<integer>)>.  It is true if the corresponding
backreference C<\I<integer>> matched earlier in the regexp.  The same
thing can be done with a name associated with a capture group, written
as C<<< (E<lt>I<name>E<gt>) >>> or C<< ('I<name>') >>.  The second form is a bare
zero-width assertion C<(?...)>, either a lookahead, a lookbehind, or a
code assertion (discussed in the next section).  The third set of forms
provides tests that return true if the expression is executed within
a recursion (C<(R)>) or is being called from some capturing group,
referenced either by number (C<(R1)>, C<(R2)>,...) or by name
(C<(R&I<name>)>).

The integer or name form of the C<condition> allows us to choose,
with more flexibility, what to match based on what matched earlier in the
regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">:

    % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words
    beriberi
    coco
    couscous
    deed
    ...
    toot
    toto
    tutu

The lookbehind C<condition> allows, along with backreferences,
an earlier part of the match to influence a later part of the
match.  For instance,

    /[ATGC]+(?(?<=AA)G|C)$/;

matches a DNA sequence such that it either ends in C<AAG>, or some
other base pair combination and C<'C'>.  Note that the form is
C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
lookahead, lookbehind or code assertions, the parentheses around the
conditional are not needed.


=head2 Defining named patterns

Some regular expressions use identical subpatterns in several places.
Starting with Perl 5.10, it is possible to define named subpatterns in
a section of the pattern so that they can be called up by name
anywhere in the pattern.  This syntactic pattern for this definition
group is C<< (?(DEFINE)(?<I<name>>I<pattern>)...) >>.  An insertion
of a named pattern is written as C<(?&I<name>)>.

The example below illustrates this feature using the pattern for
floating point numbers that was presented earlier on.  The three
subpatterns that are used more than once are the optional sign, the
digit sequence for an integer and the decimal fraction.  The C<DEFINE>
group at the end of the pattern contains their definition.  Notice
that the decimal fraction pattern is the first place where we can
reuse the integer pattern.

   /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
      (?: [eE](?&osg)(?&int) )?
    $
    (?(DEFINE)
      (?<osg>[-+]?)         # optional sign
      (?<int>\d++)          # integer
      (?<dec>\.(?&int))     # decimal fraction
    )/x


=head2 Recursive patterns

This feature (introduced in Perl 5.10) significantly extends the
power of Perl's pattern matching.  By referring to some other
capture group anywhere in the pattern with the construct
C<(?I<group-ref>)>, the I<pattern> within the referenced group is used
as an independent subpattern in place of the group reference itself.
Because the group reference may be contained I<within> the group it
refers to, it is now possible to apply pattern matching to tasks that
hitherto required a recursive parser.

To illustrate this feature, we'll design a pattern that matches if
a string contains a palindrome. (This is a word or a sentence that,
while ignoring spaces, interpunctuation and case, reads the same backwards
as forwards. We begin by observing that the empty string or a string
containing just one word character is a palindrome. Otherwise it must
have a word character up front and the same at its end, with another
palindrome in between.

 /(?: (\w) (?...Here be a palindrome...) \g{ -1 } | \w? )/x

Adding C<\W*> at either end to eliminate what is to be ignored, we already
have the full pattern:

    my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
    for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
        print "'$s' is a palindrome\n" if $s =~ /$pp/;
    }

In C<(?...)> both absolute and relative backreferences may be used.
The entire pattern can be reinserted with C<(?R)> or C<(?0)>.
If you prefer to name your groups, you can use C<(?&I<name>)> to
recurse into that group.


=head2 A bit of magic: executing Perl code in a regular expression

Normally, regexps are a part of Perl expressions.
I<Code evaluation> expressions turn that around by allowing
arbitrary Perl code to be a part of a regexp.  A code evaluation
expression is denoted C<(?{I<code>})>, with I<code> a string of Perl
statements.

Code expressions are zero-width assertions, and the value they return
depends on their environment.  There are two possibilities: either the
code expression is used as a conditional in a conditional expression
C<(?(I<condition>)...)>, or it is not.  If the code expression is a
conditional, the code is evaluated and the result (I<i.e.>, the result of
the last statement) is used to determine truth or falsehood.  If the
code expression is not used as a conditional, the assertion always
evaluates true and the result is put into the special variable
C<$^R>.  The variable C<$^R> can then be used in code expressions later
in the regexp.  Here are some silly examples:

    $x = "abcdef";
    $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
                                         # prints 'Hi Mom!'
    $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
                                         # no 'Hi Mom!'

Pay careful attention to the next example:

    $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
                                         # no 'Hi Mom!'
                                         # but why not?

At first glance, you'd think that it shouldn't print, because obviously
the C<ddd> isn't going to match the target string. But look at this
example:

    $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match,
                                            # but _does_ print

Hmm. What happened here? If you've been following along, you know that
the above pattern should be effectively (almost) the same as the last one;
enclosing the C<'d'> in a character class isn't going to change what it
matches. So why does the first not print while the second one does?

The answer lies in the optimizations the regexp engine makes. In the first
case, all the engine sees are plain old characters (aside from the
C<?{}> construct). It's smart enough to realize that the string C<'ddd'>
doesn't occur in our target string before actually running the pattern
through. But in the second case, we've tricked it into thinking that our
pattern is more complicated. It takes a look, sees our
character class, and decides that it will have to actually run the



( run in 0.337 second using v1.01-cache-2.11-cpan-1dc43b0fbd2 )