perl
view release on metacpan or search on metacpan
pod/perlre.pod view on Meta::CPAN
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the C<a++> will gobble up all the C<"a">'s in the
string and won't leave any for the remaining part of the pattern. This
feature can be extremely useful to give perl hints about where it
shouldn't backtrack. For instance, the typical "match a double-quoted
string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not
help. See the independent subexpression
C<L</(?E<gt>I<pattern>)>> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
Note that the possessive quantifier modifier can not be combined
with the non-greedy modifier. This is because it would make no sense.
Consider the follow equivalency table:
Illegal Legal
------------ ------
X??+ X{0}
X+?+ X{1}
X{min,max}?+ X{min}
=head3 Escape sequences
Because patterns are processed as double-quoted strings, the following
also work:
\t tab (HT, TAB)
\n newline (LF, NL)
\r return (CR)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
\cK control char (example: VT)
\x{}, \x00 character whose ordinal is the given hexadecimal number
\N{name} named Unicode character or character sequence
\N{U+263D} Unicode character (example: FIRST QUARTER MOON)
\o{}, \000 character whose ordinal is the given octal number
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase until \E (think vi)
\U uppercase until \E (think vi)
\Q quote (disable) pattern metacharacters until \E
\E end either case modification or quoted section, think vi
Details are in L<perlop/Quote and Quote-like Operators>.
=head3 Character Classes and other Special Escapes
In addition, Perl defines the following:
X<\g> X<\k> X<\K> X<backreference>
Sequence Note Description
[...] [1] Match a character according to the rules of the
bracketed character class defined by the "...".
Example: [a-z] matches "a" or "b" or "c" ... or "z"
[[:...:]] [2] Match a character according to the rules of the POSIX
character class "..." within the outer bracketed
character class. Example: [[:upper:]] matches any
uppercase character.
(?[...]) [8] Extended bracketed character class
\w [3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
marks)
\W [3] Match a non-"word" character
\s [3] Match a whitespace character
\S [3] Match a non-whitespace character
\d [3] Match a decimal digit character
\D [3] Match a non-digit character
\pP [3] Match P, named property. Use \p{Prop} for longer names
\PP [3] Match non-P
\X [4] Match Unicode "eXtended grapheme cluster"
\1 [5] Backreference to a specific capture group or buffer.
'1' may actually be any positive integer.
\g1 [5] Backreference to a specific or previous group,
\g{-1} [5] The number may be negative indicating a relative
previous group and may optionally be wrapped in
curly brackets for safer parsing.
\g{name} [5] Named backreference
\k<name> [5] Named backreference
\k'name' [5] Named backreference
\k{name} [5] Named backreference
\K [6] Keep the stuff left of the \K, don't include it in $&
\N [7] Any character but \n. Not affected by /s modifier
\v [3] Vertical whitespace
\V [3] Not vertical whitespace
\h [3] Horizontal whitespace
\H [3] Not horizontal whitespace
\R [4] Linebreak
=over 4
=item [1]
See L<perlrecharclass/Bracketed Character Classes> for details.
=item [2]
See L<perlrecharclass/POSIX Character Classes> for details.
=item [3]
See L<perlunicode/Unicode Character Properties> for details
=item [4]
See L<perlrebackslash/Misc> for details.
=item [5]
See L</Capture groups> below for details.
=item [6]
See L</Extended Patterns> below for details.
=item [7]
Note that C<\N> has two meanings. When of the form C<\N{I<NAME>}>, it
matches the character or character sequence whose name is I<NAME>; and
similarly
when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
code point is I<hex>. Otherwise it matches any character but C<\n>.
=item [8]
See L<perlrecharclass/Extended Bracketed Character Classes> for details.
=back
=head3 Assertions
Besides L<C<"^"> and C<"$">|/Metacharacters>, Perl defines the following
zero-width assertions:
X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
X<regexp, zero-width assertion>
X<regular expression, zero-width assertion>
X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
pod/perlre.pod view on Meta::CPAN
A Unicode boundary (C<\b{}>), available starting in v5.22, is a spot
between two characters, or before the first character in the string, or
after the final character in the string where certain criteria defined
by Unicode are met. See L<perlrebackslash/\b{}, \b, \B{}, \B> for
details.
A word boundary (C<\b>) is a spot between two characters
that has a C<\w> on one side of it and a C<\W> on the other side
of it (in either order), counting the imaginary characters off the
beginning and end of the string as matching a C<\W>. (Within
character classes C<\b> represents backspace rather than a word
boundary, just as it normally does in any double-quoted string.)
The C<\A> and C<\Z> are just like C<"^"> and C<"$">, except that they
won't match multiple times when the C</m> modifier is used, while
C<"^"> and C<"$"> will match at every internal line boundary. To match
the actual end of the string and not ignore an optional trailing
newline, use C<\z>.
X<\b> X<\A> X<\Z> X<\z> X</m>
The C<\G> assertion can be used to chain global matches (using
C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
It is also useful when writing C<lex>-like scanners, when you have
several patterns that you want to match against consequent substrings
of your string; see the previous reference. The actual location
where C<\G> will match can also be influenced by using C<pos()> as
an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
is modified somewhat, in that contents to the left of C<\G> are
not counted when determining the length of the match. Thus the following
will not match forever:
X<\G>
my $string = 'ABC';
pos($string) = 1;
while ($string =~ /(.\G)/g) {
print $1;
}
It will print 'A' and then terminate, as it considers the match to
be zero-width, and thus will not match at the same position twice in a
row.
It is worth noting that C<\G> improperly used can result in an infinite
loop. Take care when using patterns that include C<\G> in an alternation.
Note also that C<s///> will refuse to overwrite part of a substitution
that has already been replaced; so for example this will stop after the
first iteration, rather than iterating its way backwards through the
string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
=head3 Capture groups
The grouping construct C<( ... )> creates capture groups (also referred to as
capture buffers). To refer to the current contents of a group later on, within
the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
for the second, and so on.
This is called a I<backreference>.
X<regex, capture buffer> X<regexp, capture buffer>
X<regex, capture group> X<regexp, capture group>
X<regular expression, capture buffer> X<backreference>
X<regular expression, capture group> X<backreference>
X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
X<named capture buffer> X<regular expression, named capture buffer>
X<named capture group> X<regular expression, named capture group>
X<%+> X<$+{name}> X<< \k<name> >>
There is no limit to the number of captured substrings that you may use.
Groups are numbered with the leftmost open parenthesis being number 1, I<etc>. If
a group did not match, the associated backreference won't match either. (This
can happen if the group is optional, or in a different branch of an
alternation.)
You can omit the C<"g">, and write C<"\1">, I<etc>, but there are some issues with
this form, described below.
You can also refer to capture groups relatively, by using a negative number, so
that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
example:
/
(Y) # group 1
( # group 2
(X) # group 3
\g{-1} # backref to group 3
\g{-3} # backref to group 1
)
/x
would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
interpolate regexes into larger regexes and not have to worry about the
capture groups being renumbered.
You can dispense with numbers altogether and create named capture groups.
The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
I<name> must not begin with a number, nor contain hyphens.
When different groups within the same pattern have the same name, any reference
to that name assumes the leftmost defined group. Named groups count in
absolute and relative numbering, and so can also be referred to by those
numbers.
(It's possible to do things with named capture groups that would otherwise
require C<(??{})>.)
Capture group contents are dynamically scoped and available to you outside the
pattern until the end of the enclosing block or until the next successful
match in the same scope, whichever comes first.
See L<perlsyn/"Compound Statements"> and
L<perlvar/"Scoping Rules of Regex Variables"> for more details.
You can access the contents of a capture group by absolute number (using
C<"$1"> instead of C<"\g1">, I<etc>); or by name via the C<%+> hash,
using C<"$+{I<name>}">.
Braces are required in referring to named capture groups, but are optional for
absolute or relative numbered ones. Braces are safer when creating a regex by
concatenating smaller strings. For example if you have C<qr/$x$y/>, and C<$x>
contained C<"\g1">, and C<$y> contained C<"37">, you would get C</\g137/> which
is probably not what you intended.
If you use braces, you may also optionally add any number of blank
(space or tab) characters within but adjacent to the braces, like
S<C<\g{ -1 }>>, or S<C<\k{ I<name> }>>.
The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
there were no named nor relative numbered capture groups. Absolute numbered
groups were referred to using C<\1>,
C<\2>, I<etc>., and this notation is still
accepted (and likely always will be). But it leads to some ambiguities if
there are more than 9 capture groups, as C<\10> could mean either the tenth
capture group, or the character whose ordinal in octal is 010 (a backspace in
ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
only if at least 10 left parentheses have opened before it. Likewise C<\11> is
a backreference only if at least 11 left parentheses have opened before it.
And so on. C<\1> through C<\9> are always interpreted as backreferences.
There are several examples below that illustrate these perils. You can avoid
the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups;
and for octal constants always using C<\o{}>, or for C<\077> and below, using 3
digits padded with leading zeros, since a leading zero implies an octal
constant.
The C<\I<digit>> notation also works in certain circumstances outside
the pattern. See L</Warning on \1 Instead of $1> below for details.
Examples:
s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
/(.)\g1/ # find first doubled char
and print "'$1' is the first doubled character\n";
/(?<char>.)\k<char>/ # ... a different way
and print "'$+{char}' is the first doubled character\n";
/(?'char'.)\g1/ # ... mix and match
and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) { # parse out values
$hours = $1;
$minutes = $2;
$seconds = $3;
}
/(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
/(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
/((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
/((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
$x = '(.)\1'; # Creates problems when concatenated.
$y = '(.)\g{1}'; # Avoids the problems.
"aa" =~ /${x}/; # True
"aa" =~ /${y}/; # True
"aa0" =~ /${x}0/; # False!
"aa0" =~ /${y}0/; # True
"aa\x08" =~ /${x}0/; # True!
"aa\x08" =~ /${y}0/; # False
Several special variables also refer back to portions of the previous
match. C<$+> returns whatever the last bracket match matched.
C<$&> returns the entire matched string. (At one point C<$0> did
also, but now it returns the name of the program.) C<$`> returns
everything before the matched string. C<$'> returns everything
after the matched string. And C<$^N> contains whatever was matched by
the most-recently closed group (submatch). C<$^N> can be used in
extended patterns (see below), for example to assign a submatch to a
variable.
X<$+> X<$^N> X<$&> X<$`> X<$'>
These special variables, like the C<%+> hash and the numbered match variables
(C<$1>, C<$2>, C<$3>, I<etc>.) are dynamically scoped
until the end of the enclosing block or until the next successful
match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
X<$+> X<$^N> X<$&> X<$`> X<$'>
X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
X<@{^CAPTURE}>
The C<@{^CAPTURE}> array may be used to access ALL of the capture buffers
as an array without needing to know how many there are. For instance
$string=~/$pattern/ and @captured = @{^CAPTURE};
will place a copy of each capture variable, C<$1>, C<$2> etc, into the
C<@captured> array.
Be aware that when interpolating a subscript of the C<@{^CAPTURE}>
array you must use demarcated curly brace notation:
print "${^CAPTURE[0]}";
See L<perldata/"Demarcated variable names using braces"> for more on
this notation.
B<NOTE>: Failed matches in Perl do not reset the match variables,
which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
B<WARNING>: If your code is to run on Perl 5.16 or earlier,
beware that once Perl sees that you need one of C<$&>, C<$`>, or
C<$'> anywhere in the program, it has to provide them for every
pattern match. This may substantially slow your program.
Perl uses the same mechanism to produce C<$1>, C<$2>, I<etc>, so you also
pay a price for each pattern that contains capturing parentheses.
(To avoid this cost while retaining the grouping behaviour, use the
extended regular expression C<(?: ... )> instead.) But if you never
use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
if you can, but if you can't (and some algorithms really appreciate
them), once you've used them once, use them at will, because you've
already paid the price.
pod/perlre.pod view on Meta::CPAN
=over 4
=item C<(?#I<text>)>
X<(?#)>
A comment. The I<text> is ignored.
Note that Perl closes
the comment as soon as it sees a C<")">, so there is no way to put a literal
C<")"> in the comment. The pattern's closing delimiter must be escaped by
a backslash if it appears in the comment.
See L</E<sol>x> for another way to have comments in patterns.
Note that a comment can go just about anywhere, except in the middle of
an escape sequence. Examples:
qr/foo(?#comment)bar/' # Matches 'foobar'
# The pattern below matches 'abcd', 'abccd', or 'abcccd'
qr/abc(?#comment between literal and its quantifier){1,3}d/
# The pattern below generates a syntax error, because the '\p' must
# be followed immediately by a '{'.
qr/\p(?#comment between \p and its property name){Any}/
# The pattern below generates a syntax error, because the initial
# '\(' is a literal opening parenthesis, and so there is nothing
# for the closing ')' to match
qr/\(?#the backslash means this isn't a comment)p{Any}/
# Comments can be used to fold long patterns into multiple lines
qr/First part of a long regex(?#
)remaining part/
=item C<(?adlupimnsx-imnsx)>
=item C<(?^alupimnsx)>
X<(?)> X<(?^)>
Zero or more embedded pattern-match modifiers, to be turned on (or
turned off if preceded by C<"-">) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any).
This is particularly useful for dynamically-generated patterns,
such as those read in from a
configuration file, taken from an argument, or specified in a table
somewhere. Consider the case where some patterns want to be
case-sensitive and some do not: The case-insensitive ones merely need to
include C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
if ( /$pattern/i ) { }
# more flexible:
$pattern = "(?i)foobar";
if ( /$pattern/ ) { }
These modifiers are restored at the end of the enclosing group. For example,
( (?i) blah ) \s+ \g1
will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
repetition of the previous word, assuming the C</x> modifier, and no C</i>
modifier outside this group.
These modifiers do not carry over into named subpatterns called in the
enclosing group. In other words, a pattern such as C<((?i)(?&I<NAME>))> does not
change the case-sensitivity of the I<NAME> pattern.
A modifier is overridden by later occurrences of this construct in the
same scope containing the same modifier, so that
/((?im)foo(?-m)bar)/
matches all of C<foobar> case insensitively, but uses C</m> rules for
only the C<foo> portion. The C<"a"> flag overrides C<aa> as well;
likewise C<aa> overrides C<"a">. The same goes for C<"x"> and C<xx>.
Hence, in
/(?-x)foo/xx
both C</x> and C</xx> are turned off during matching C<foo>. And in
/(?x)foo/x
C</x> but NOT C</xx> is turned on for matching C<foo>. (One might
mistakenly think that since the inner C<(?x)> is already in the scope of
C</x>, that the result would effectively be the sum of them, yielding
C</xx>. It doesn't work that way.) Similarly, doing something like
C<(?xx-x)foo> turns off all C<"x"> behavior for matching C<foo>, it is not
that you subtract 1 C<"x"> from 2 to get 1 C<"x"> remaining.
Any of these modifiers can be set to apply globally to all regular
expressions compiled within the scope of a C<use re>. See
L<re/"'/flags' mode">.
Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
after the C<"?"> is a shorthand equivalent to C<d-imnsx>. Flags (except
C<"d">) may follow the caret to override it.
But a minus sign is not legal with it.
Note that the C<"a">, C<"d">, C<"l">, C<"p">, and C<"u"> modifiers are special in
that they can only be enabled, not disabled, and the C<"a">, C<"d">, C<"l">, and
C<"u"> modifiers are mutually exclusive: specifying one de-specifies the
others, and a maximum of one (or two C<"a">'s) may appear in the
construct. Thus, for
example, C<(?-p)> will warn when compiled under C<use warnings>;
C<(?-d:...)> and C<(?dl:...)> are fatal errors.
Note also that the C<"p"> modifier is special in that its presence
anywhere in a pattern has a global effect.
Having zero modifiers makes this a no-op (so why did you specify it,
unless it's generated code), and starting in v5.30, warns under L<C<use
re 'strict'>|re/'strict' mode>.
=item C<(?:I<pattern>)>
X<(?:)>
=item C<(?adluimnsx-imnsx:I<pattern>)>
pod/perlre.pod view on Meta::CPAN
In Perl 5.35.10 the scope of the experimental nature of this construct
has been reduced, and experimental warnings will only be produced when
the construct contains capturing parentheses. The warnings will be
raised at pattern compilation time, unless turned off, in the
C<experimental::vlb> category. This is to warn you that the exact
contents of capturing buffers in a variable length negative lookbehind
is not well defined and is subject to change in a future release of perl.
Currently if you use capture buffers inside of a negative variable length
lookbehind the result may not be what you expect, for instance:
say "axfoo"=~/(?=foo)(?<!(a|ax)(?{ say $1 }))/ ? "y" : "n";
will output the following:
a
no
which does not make sense as this should print out "ax" as the "a" does
not line up at the correct place. Another example would be:
say "yes: '$1-$2'" if "aayfoo"=~/(?=foo)(?<!(a|aa)(a|aa)x)/;
will output the following:
yes: 'aa-a'
It is possible in a future release of perl we will change this behavior
so both of these examples produced more reasonable output.
Note that we are confident that the construct will match and reject
patterns appropriately, the undefined behavior strictly relates to the
value of the capture buffer during or after matching.
There is a technique that can be used to handle variable length
lookbehind on earlier releases, and longer than 255 characters. It is
described in
L<http://www.drregex.com/2019/02/variable-length-lookbehinds-actually.html>.
Note that under C</i>, a few single characters match two or three other
characters. This makes them variable length, and the 255 length applies
to the maximum number of characters in the match. For
example C<qr/\N{LATIN SMALL LETTER SHARP S}/i> matches the sequence
C<"ss">. Your lookbehind assertion could contain 127 Sharp S
characters under C</i>, but adding a 128th would generate a compilation
error, as that could match 256 C<"s"> characters in a row.
Use of the non-greedy modifier C<"?"> may not give you the expected
results if it is within a capturing group within the construct.
=back
=item C<< (?<I<NAME>>I<pattern>) >>
=item C<(?'I<NAME>'I<pattern>)>
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture group. Identical in every respect to normal capturing
parentheses C<()> but for the additional fact that the group
can be referred to by name in various regular expression
constructs (like C<\g{I<NAME>}>) and can be accessed by name
after a successful match via C<%+> or C<%->. See L<perlvar>
for more details on the C<%+> and C<%-> hashes.
If multiple distinct capture groups have the same name, then
C<$+{I<NAME>}> will refer to the leftmost defined group in the match.
The forms C<(?'I<NAME>'I<pattern>)> and C<< (?<I<NAME>>I<pattern>) >>
are equivalent.
B<NOTE:> While the notation of this construct is the same as the similar
function in .NET regexes, the behavior is not. In Perl the groups are
numbered sequentially regardless of being named or not. Thus in the
pattern
/(x)(?<foo>y)(z)/
C<$+{foo}> will be the same as C<$2>, and C<$3> will contain 'z' instead of
the opposite which is what a .NET regex hacker might expect.
Currently I<NAME> is restricted to simple identifiers only.
In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
its Unicode extension (see L<utf8>),
though it isn't extended by the locale (see L<perllocale>).
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines, the pattern C<<
(?PE<lt>I<NAME>E<gt>I<pattern>) >>
may be used instead of C<< (?<I<NAME>>I<pattern>) >>; however this form does not
support the use of single quotes as a delimiter for the name.
=item C<< \k<I<NAME>> >>
=item C<< \k'I<NAME>' >>
=item C<< \k{I<NAME>} >>
Named backreference. Similar to numeric backreferences, except that
the group is designated by name and not number. If multiple groups
have the same name then it refers to the leftmost defined group in
the current match.
It is an error to refer to a name not defined by a C<< (?<I<NAME>>) >>
earlier in the pattern.
All three forms are equivalent, although with C<< \k{ I<NAME> } >>,
you may optionally have blanks within but adjacent to the braces, as
shown.
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines, the pattern C<< (?P=I<NAME>) >>
may be used instead of C<< \k<I<NAME>> >>.
=item C<(?{ I<code> })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
B<WARNING>: Using this feature safely requires that you understand its
limitations. Code executed that has side effects may not perform identically
from version to version due to the effect of future optimisations in the regex
engine. For more information on this, see L</Embedded Code Execution
Frequency>.
pod/perlre.pod view on Meta::CPAN
Also, it's worth noting that patterns defined this way probably will
not be as efficient, as the optimizer is not very clever about
handling them.
An example of how this might be used is as follows:
/(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
(?(DEFINE)
(?<NAME_PAT>....)
(?<ADDRESS_PAT>....)
)/x
Note that capture groups matched inside of recursion are not accessible
after the recursion returns, so the extra layer of capturing groups is
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
Finally, keep in mind that subpatterns created inside a DEFINE block
count towards the absolute and relative number of captures, so this:
my @captures = "a" =~ /(.) # First capture
(?(DEFINE)
(?<EXAMPLE> 1 ) # Second capture
)/x;
say scalar @captures;
Will output 2, not 1. This is particularly important if you intend to
compile the definitions with the C<qr//> operator, and later
interpolate them in another pattern.
=item C<< (?>I<pattern>) >>
=item C<< (*atomic:I<pattern>) >>
X<(?E<gt>pattern)>
X<(*atomic>
X<backtrack> X<backtracking> X<atomic> X<possessive>
An "independent" subexpression, one which matches the substring
that a standalone I<pattern> would match if anchored at the given
position, and it matches I<nothing other than this substring>. This
construct is useful for optimizations of what would otherwise be
"eternal" matches, because it will not backtrack (see L</"Backtracking">).
It may also be useful in places where the "grab all you can, and do not
give anything back" semantic is desirable.
For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
(anchored at the beginning of string, as above) will match I<all>
characters C<"a"> at the beginning of string, leaving no C<"a"> for
C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
since the match of the subgroup C<a*> is influenced by the following
group C<ab> (see L</"Backtracking">). In particular, C<a*> inside
C<a*ab> will match fewer characters than a standalone C<a*>, since
this makes the tail match.
C<< (?>I<pattern>) >> does not disable backtracking altogether once it has
matched. It is still possible to backtrack past the construct, but not
into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
An effect similar to C<< (?>I<pattern>) >> may be achieved by writing
C<(?=(I<pattern>))\g{-1}>. This matches the same substring as a standalone
C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
in the rest of a regular expression.)
Consider this pattern:
m{ \(
(
[^()]+ # x+
|
\( [^()]* \)
)+
\)
}x
That will efficiently match a nonempty group with matching parentheses
two levels deep or less. However, if there is no such group, it
will take virtually forever on a long string. That's because there
are so many different ways to split a long string into several
substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
to a subpattern of the above pattern. Consider how the pattern
above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
seconds, but that each extra letter doubles this time. This
exponential performance will make it appear that your program has
hung. However, a tiny change to this pattern
m{ \(
(
(?> [^()]+ ) # change x+ above to (?> x+ )
|
\( [^()]* \)
)+
\)
}x
which uses C<< (?>...) >> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
the time when used on a similar string with 1000000 C<"a">s. Be aware,
however, that, when this construct is followed by a
quantifier, it currently triggers a warning message under
the C<use warnings> pragma or B<-w> switch saying it
C<"matches null string many times in regex">.
On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<"a">s.
The "grab all you can, and do not give anything back" semantic is desirable
in many situations where on the first sight a simple C<()*> looks like
the correct solution. Suppose we parse text with comments being delimited
by C<"#"> followed by some optional (horizontal) whitespace. Contrary to
its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
the comment delimiter, because it may "give up" some whitespace if
the remainder of the pattern can be made to match that way. The correct
answer is either one of these:
(?>#[ \t]*)
#[ \t]*(?![ \t])
pod/perlre.pod view on Meta::CPAN
$re = customre::convert $re;
/\Y|$re\Y|/;
=head2 Embedded Code Execution Frequency
The exact rules for how often C<(?{})> and C<(??{})> are executed in a pattern
are unspecified, and this is even more true of C<(*{})>.
In the case of a successful match you can assume that they DWIM and
will be executed in left to right order the appropriate number of times in the
accepting path of the pattern as would any other meta-pattern. How non-
accepting pathways and match failures affect the number of times a pattern is
executed is specifically unspecified and may vary depending on what
optimizations can be applied to the pattern and is likely to change from
version to version.
For instance in
"aaabcdeeeee"=~/a(?{print "a"})b(?{print "b"})cde/;
the exact number of times "a" or "b" are printed out is unspecified for
failure, but you may assume they will be printed at least once during
a successful match, additionally you may assume that if "b" is printed,
it will be preceded by at least one "a".
In the case of branching constructs like the following:
/a(b|(?{ print "a" }))c(?{ print "c" })/;
you can assume that the input "ac" will output "ac", and that "abc"
will output only "c".
When embedded code is quantified, successful matches will call the
code once for each matched iteration of the quantifier. For
example:
"good" =~ /g(?:o(?{print "o"}))*d/;
will output "o" twice.
For historical and consistency reasons the use of normal code blocks
anywhere in a pattern will disable certain optimisations. As of 5.37.7
you can use an "optimistic" codeblock, C<(*{ ... })> as a replacement
for C<(?{ ... })>, if you do *not* wish to disable these optimisations.
This may result in the code block being called less often than it might
have been had they not been optimistic.
=head2 PCRE/Python Support
As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
to the regex syntax. While Perl programmers are encouraged to use the
Perl-specific syntax, the following are also accepted:
=over 4
=item C<< (?PE<lt>I<NAME>E<gt>I<pattern>) >>
Define a named capture group. Equivalent to C<< (?<I<NAME>>I<pattern>) >>.
=item C<< (?P=I<NAME>) >>
Backreference to a named capture group. Equivalent to C<< \g{I<NAME>} >>.
=item C<< (?P>I<NAME>) >>
Subroutine call to a named capture group. Equivalent to C<< (?&I<NAME>) >>.
=back
=head1 BUGS
There are a number of issues with regard to case-insensitive matching
in Unicode rules. See C<"i"> under L</Modifiers> above.
This document varies from difficult to understand to completely
and utterly opaque. The wandering prose riddled with jargon is
hard to fathom in several places.
This document needs a rewrite that separates the tutorial content
from the reference content.
=head1 SEE ALSO
The syntax of patterns used in Perl pattern matching evolved from those
supplied in the Bell Labs Research Unix 8th Edition (Version 8) regex
routines. (The code is actually derived (distantly) from Henry
Spencer's freely redistributable reimplementation of those V8 routines.)
L<perlrequick>.
L<perlretut>.
L<perlop/"Regexp Quote-Like Operators">.
L<perlop/"Gory details of parsing quoted constructs">.
L<perlfaq6>.
L<perlfunc/pos>.
L<perllocale>.
L<perlebcdic>.
I<Mastering Regular Expressions> by Jeffrey Friedl, published
by O'Reilly and Associates.
( run in 0.385 second using v1.01-cache-2.11-cpan-fd5d4e115d8 )