perl
view release on metacpan or search on metacpan
pod/perlrequick.pod view on Meta::CPAN
the next position in the string. Some examples:
"cats and dogs" =~ /cat|dog|bird/; # matches "cat"
"cats and dogs" =~ /dog|cat|bird/; # matches "cat"
Even though C<dog> is the first alternative in the second regex,
C<cat> is able to match earlier in the string.
"cats" =~ /c|ca|cat|cats/; # matches "c"
"cats" =~ /cats|cat|ca|c/; # matches "cats"
At a given character position, the first alternative that allows the
regex match to succeed will be the one that matches. Here, all the
alternatives match at the first string position, so the first matches.
=head2 Grouping things and hierarchical matching
The B<grouping> metacharacters C<()> allow a part of a regex to be
treated as a single unit. Parts of a regex are grouped by enclosing
them in parentheses. The regex C<house(cat|keeper)> means match
C<house> followed by either C<cat> or C<keeper>. Some more examples
are
/(a|b)b/; # matches 'ab' or 'bb'
/(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
/house(cat|)/; # matches either 'housecat' or 'house'
/house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
# 'house'. Note groups can be nested.
"20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
# because '20\d\d' can't match
=head2 Extracting matches
The grouping metacharacters C<()> also allow the extraction of the
parts of a string that matched. For each grouping, the part that
matched inside goes into the special variables C<$1>, C<$2>, etc.
They can be used just as ordinary variables:
# extract hours, minutes, seconds
$time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
$hours = $1;
$minutes = $2;
$seconds = $3;
In list context, a match C</regex/> with groupings will return the
list of matched values C<($1,$2,...)>. So we could rewrite it as
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
If the groupings in a regex are nested, C<$1> gets the group with the
leftmost opening parenthesis, C<$2> the next opening parenthesis,
etc. For example, here is a complex regex and the matching variables
indicated below it:
/(ab(cd|ef)((gi)|j))/;
1 2 34
Associated with the matching variables C<$1>, C<$2>, ... are
the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
matching variables that can be used I<inside> a regex:
/(\w\w\w)\s\g1/; # find sequences like 'the the' in string
C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
C<\g2>, ... only inside a regex.
=head2 Matching repetitions
The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
to determine the number of repeats of a portion of a regex we
consider to be a match. Quantifiers are put immediately after the
character, character class, or grouping that we want to specify. They
have the following meanings:
=over 4
=item *
C<a?> = match 'a' 1 or 0 times
=item *
C<a*> = match 'a' 0 or more times, i.e., any number of times
=item *
C<a+> = match 'a' 1 or more times, i.e., at least once
=item *
C<a{n,m}> = match at least C<n> times, but not more than C<m>
times.
=item *
C<a{n,}> = match at least C<n> or more times
=item *
C<a{,n}> = match C<n> times or fewer (Added in v5.34)
=item *
C<a{n}> = match exactly C<n> times
=back
Here are some examples:
/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
# any number of digits
/(\w+)\s+\g1/; # match doubled words of arbitrary length
$year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
# than 4 digits
$year =~ /^\d{ 4 }$|^\d{2}$/; # better match; throw out 3 digit dates
These quantifiers will try to match as much of the string as possible,
while still allowing the regex to match. So we have
$x = 'the cat in the hat';
$x =~ /^(.*)(at)(.*)$/; # matches,
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = '' (0 matches)
The first quantifier C<.*> grabs as much of the string as possible
while still having the regex match. The second quantifier C<.*> has
no string left to it, so it matches 0 times.
=head2 More matching
There are a few more things you might want to know about matching
operators.
The global modifier C</g> allows the matching operator to match
within a string as many times as possible. In scalar context,
successive matches against a string will have C</g> jump from match
to match, keeping track of position in the string as it goes along.
You can get or set the position with the C<pos()> function.
For example,
$x = "cat dog house"; # 3 words
while ($x =~ /(\w+)/g) {
print "Word is $1, ends at position ", pos $x, "\n";
}
prints
Word is cat, ends at position 3
Word is dog, ends at position 7
Word is house, ends at position 13
A failed match or changing the target string resets the position. If
you don't want the position reset after failure to match, add the
C</c>, as in C</regex/gc>.
In list context, C</g> returns a list of matched groupings, or if
there are no groupings, a list of matches to the whole regex. So
@words = ($x =~ /(\w+)/g); # matches,
# $word[0] = 'cat'
# $word[1] = 'dog'
# $word[2] = 'house'
=head2 Search and replace
Search and replace is performed using C<s/regex/replacement/modifiers>.
The C<replacement> is a Perl double-quoted string that replaces in the
string whatever is matched with the C<regex>. The operator C<=~> is
also used here to associate a string with C<s///>. If matching
against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
C<s///> returns the number of substitutions made; otherwise it returns
false. Here are a few examples:
( run in 0.735 second using v1.01-cache-2.11-cpan-97f6503c9c8 )