white space and some spelling

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@91117 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
Hakan Kuecuekyilmaz 2002-08-06 20:04:34 +00:00
parent 505aacbeef
commit 931ba4788e

View file

@ -1,5 +1,5 @@
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- $Revision: 1.4 $ -->
<!-- $Revision: 1.5 $ -->
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
<refentry id="pcre.pattern.syntax">
<refnamediv>
@ -25,8 +25,8 @@
<listitem>
<simpara>
By default, a whitespace character is any character that
the C library function isspace() recognizes, though it is
possible to compile PCRE with alternative character type
the C library function isspace() recognizes, though it is
possible to compile PCRE with alternative character type
tables. Normally isspace() matches space, formfeed, newline,
carriage return, horizontal tab, and vertical tab. Perl 5 no
longer includes vertical tab in its set of whitespace characters.
@ -38,19 +38,19 @@
</listitem>
<listitem>
<simpara>
PCRE does not allow repeat quantifiers on lookahead
PCRE does not allow repeat quantifiers on lookahead
assertions. Perl permits them, but they do not mean what you
might think. For example, (?!a){3} does not assert that the
next three characters are not "a". It just asserts that the
next three characters are not "a". It just asserts that the
next character is not "a" three times.
</simpara>
</listitem>
<listitem>
<simpara>
Capturing subpatterns that occur inside negative looka-
head assertions are counted, but their entries in the
offsets vector are never set. Perl sets its numerical vari-
ables from any such patterns that are matched before the
Capturing subpatterns that occur inside negative looka-
head assertions are counted, but their entries in the
offsets vector are never set. Perl sets its numerical vari-
ables from any such patterns that are matched before the
assertion fails to match something (thereby succeeding), but
only if the negative lookahead assertion contains just one
branch.
@ -59,8 +59,8 @@
<listitem>
<simpara>
Though binary zero characters are supported in the subject string,
they are not allowed in a pattern string because it is passed as a
normal C string, terminated by zero. The escape sequence "\\x00" can
they are not allowed in a pattern string because it is passed as a
normal C string, terminated by zero. The escape sequence "\\x00" can
be used in the pattern to represent a binary zero.
</simpara>
</listitem>
@ -80,7 +80,7 @@
</listitem>
<listitem>
<simpara>
Fairly obviously, PCRE does not support the (?{code})
Fairly obviously, PCRE does not support the (?{code})
construction.
</simpara>
</listitem>
@ -181,7 +181,7 @@
<para>
There are two different sets of meta-characters: those that
are recognized anywhere in the pattern except within square
brackets, and those that are recognized in square brackets.
brackets, and those that are recognized in square brackets.
Outside square brackets, the meta-characters are as follows:
<variablelist>
<varlistentry>
@ -196,7 +196,7 @@
<term><emphasis>^</emphasis></term>
<listitem>
<simpara>
assert start of subject (or line, in multiline mode)
assert start of subject (or line, in multiline mode)
</simpara>
</listitem>
</varlistentry>
@ -298,8 +298,8 @@
</varlistentry>
</variablelist>
Part of a pattern that is in square brackets is called a
"character class". In a character class the only meta-
Part of a pattern that is in square brackets is called a
"character class". In a character class the only meta-
characters are:
<variablelist>
<varlistentry>
@ -335,7 +335,7 @@
</listitem>
</varlistentry>
</variablelist>
The following sections describe the use of each of the
The following sections describe the use of each of the
meta-characters.
</para>
</refsect2>
@ -343,16 +343,16 @@
<title>backslash</title>
<para>
The backslash character has several uses. Firstly, if it is
followed by a non-alphameric character, it takes away any
special meaning that character may have. This use of
backslash as an escape character applies both inside and
followed by a non-alphanumeric character, it takes away any
special meaning that character may have. This use of
backslash as an escape character applies both inside and
outside character classes.
</para>
<para>
For example, if you want to match a "*" character, you write
"\*" in the pattern. This applies whether or not the follow-
ing character would otherwise be interpreted as a meta-
character, so it is always safe to precede a non-alphameric
ing character would otherwise be interpreted as a meta-
character, so it is always safe to precede a non-alphanumeric
with "\" to specify that it stands for itself. In particu-
lar, if you want to match a backslash, you write "\\".
</para>
@ -365,11 +365,11 @@
of the pattern.
</para>
<para>
A second use of backslash provides a way of encoding non-
printing characters in patterns in a visible manner. There
is no restriction on the appearance of non-printing charac-
ters, apart from the binary zero that terminates a pattern,
but when a pattern is being prepared by text editing, it is
A second use of backslash provides a way of encoding non-
printing characters in patterns in a visible manner. There
is no restriction on the appearance of non-printing characters,
apart from the binary zero that terminates a pattern,
but when a pattern is being prepared by text editing, it is
usually easier to use one of the following escape sequences
than the binary character it represents:
</para>
@ -450,38 +450,41 @@
</variablelist>
</para>
<para>
The precise effect of "<literal>\cx</literal>" is as follows: if "<literal>x</literal>" is a lower
case letter, it is converted to upper case. Then bit 6 of
the character (hex 40) is inverted. Thus "<literal>\cz</literal>" becomes hex
1A, but "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>" becomes hex 7B.
The precise effect of "<literal>\cx</literal>" is as follows:
if "<literal>x</literal>" is a lower case letter, it is converted
to upper case. Then bit 6 of the character (hex 40) is inverted.
Thus "<literal>\cz</literal>" becomes hex 1A, but
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
becomes hex 7B.
</para>
<para>
After "<literal>\x</literal>", up to two hexadecimal digits are read (letters
can be in upper or lower case).
After "<literal>\x</literal>", up to two hexadecimal digits are
read (letters can be in upper or lower case).
</para>
<para>
After "<literal>\0</literal>" up to two further octal digits are read. In both
cases, if there are fewer than two digits, just those that
are present are used. Thus the sequence "<literal>\0\x\07</literal>" specifies
two binary zeros followed by a BEL character. Make sure you
supply two digits after the initial zero if the character
After "<literal>\0</literal>" up to two further octal digits are read.
In both cases, if there are fewer than two digits, just those that
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
specifies two binary zeros followed by a BEL character. Make sure you
supply two digits after the initial zero if the character
that follows is itself an octal digit.
</para>
<para>
The handling of a backslash followed by a digit other than 0
is complicated. Outside a character class, PCRE reads it
is complicated. Outside a character class, PCRE reads it
and any following digits as a decimal number. If the number
is less than 10, or if there have been at least that many
previous capturing left parentheses in the expression, the
entire sequence is taken as a <emphasis>back</emphasis> <emphasis>reference</emphasis>. A description
entire sequence is taken as a <emphasis>back</emphasis>
<emphasis>reference</emphasis>. A description
of how this works is given later, following the discussion
of parenthesized subpatterns.
</para>
<para>
Inside a character class, or if the decimal number is
greater than 9 and there have not been that many capturing
subpatterns, PCRE re-reads up to three octal digits follow-
ing the backslash, and generates a single byte from the
greater than 9 and there have not been that many capturing
subpatterns, PCRE re-reads up to three octal digits following
the backslash, and generates a single byte from the
least significant 8 bits of the value. Any subsequent digits
stand for themselves. For example:
</para>
@ -566,15 +569,15 @@
</variablelist>
</para>
<para>
Note that octal values of 100 or greater must not be intro-
duced by a leading zero, because no more than three octal
Note that octal values of 100 or greater must not be intro-
duced by a leading zero, because no more than three octal
digits are ever read.
</para>
<para>
All the sequences that define a single byte value can be
All the sequences that define a single byte value can be
used both inside and outside character classes. In addition,
inside a character class, the sequence "<literal>\b</literal>" is interpreted
as the backspace character (hex 08). Outside a character
inside a character class, the sequence "<literal>\b</literal>"
is interpreted as the backspace character (hex 08). Outside a character
class it has a different meaning (see below).
</para>
<para>
@ -635,32 +638,32 @@
</para>
<para>
Each pair of escape sequences partitions the complete set of
characters into two disjoint sets. Any given character
characters into two disjoint sets. Any given character
matches one, and only one, of each pair.
</para>
<para>
A "word" character is any letter or digit or the underscore
A "word" character is any letter or digit or the underscore
character, that is, any character which can be part of a
Perl "<literal>word</literal>". The definition of letters and digits is
controlled by PCRE's character tables, and may vary if locale-specific
matching is taking place (see "Locale support"
controlled by PCRE's character tables, and may vary if locale-specific
matching is taking place (see "Locale support"
above). For example, in the "fr" (French) locale, some char-
acter codes greater than 128 are used for accented letters,
acter codes greater than 128 are used for accented letters,
and these are matched by <literal>\w</literal>.
</para>
<para>
These character type sequences can appear both inside and
These character type sequences can appear both inside and
outside character classes. They each match one character of
the appropriate type. If the current matching point is at
the appropriate type. If the current matching point is at
the end of the subject string, all of them fail, since there
is no character to match.
</para>
<para>
The fourth use of backslash is for certain simple asser-
tions. An assertion specifies a condition that has to be met
at a particular point in a match, without consuming any
characters from the subject string. The use of subpatterns
for more complicated assertions is described below. The
at a particular point in a match, without consuming any
characters from the subject string. The use of subpatterns
for more complicated assertions is described below. The
backslashed assertions are
</para>
<para>
@ -693,7 +696,7 @@
<term><emphasis>\Z</emphasis></term>
<listitem>
<simpara>
end of subject or newline at end (independent of
end of subject or newline at end (independent of
multiline mode)
</simpara>
</listitem>
@ -702,7 +705,7 @@
<term><emphasis>\z</emphasis></term>
<listitem>
<simpara>
end of subject (independent of multiline mode)
end of subject(independent of multiline mode)
</simpara>
</listitem>
</varlistentry>
@ -714,20 +717,23 @@
character, inside a character class).
</para>
<para>
A word boundary is a position in the subject string where
A word boundary is a position in the subject string where
the current character and the previous character do not both
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
<literal>\w</literal> and the other matches
<literal>\W</literal>), or the start or end of the string if the first or last
character matches \w, respectively.
<literal>\W</literal>), or the start or end of the string if the first
or last character matches \w, respectively.
</para>
<para>
The <literal>\A</literal>, <literal>\Z</literal>, and <literal>\z</literal> assertions differ from the traditional
The <literal>\A</literal>, <literal>\Z</literal>, and
<literal>\z</literal> assertions differ from the traditional
circumflex and dollar (described below) in that they only
ever match at the very start and end of the subject string,
whatever options are set. They are not affected by the
<link linkend="pcre.pattern.modifiers">PCRE_NOTBOL</link> or <link linkend="pcre.pattern.modifiers">PCRE_NOTEOL</link> options. The difference between
<literal>\Z</literal> and <literal>\z</literal> is that <literal>\Z</literal>
<link linkend="pcre.pattern.modifiers">PCRE_NOTBOL</link> or
<link linkend="pcre.pattern.modifiers">PCRE_NOTEOL</link> options.
The difference between <literal>\Z</literal> and
<literal>\z</literal> is that <literal>\Z</literal>
matches before a newline that is the
last character of the string as well as at the end of the
string, whereas <literal>\z</literal> matches only at the end.
@ -744,7 +750,7 @@
different meaning (see below).
Circumflex need not be the first character of the pattern if
a number of alternatives are involved, but it should be the
a number of alternatives are involved, but it should be the
first thing in each alternative in which it appears if the
pattern is ever to match that branch. If all possible alter-
natives start with a circumflex, that is, if the pattern is
@ -763,7 +769,8 @@
The meaning of dollar can be changed so that it matches only
at the very end of the string, by setting the
<link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> option at compile or matching time. This
<link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
option at compile or matching time. This
does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are
@ -873,7 +880,7 @@
For example, the class [^\W_] matches any letter or digit,
but not underscore.
All non-alphameric characters other than \, -, ^ (at the
All non-alphanumeric characters other than \, -, ^ (at the
start) and the terminating ] are non-special in character
classes, but it does no harm if they are escaped.
</literallayout>
@ -887,8 +894,8 @@
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alter-
natives may appear, and an empty alternative is permitted
matches either "gilbert" or "sullivan". Any number of alternatives
may appear, and an empty alternative is permitted
(matching the empty string). The matching process tries
each alternative in turn, from left to right, and the first
one that succeeds is used. If the alternatives are within a
@ -933,11 +940,11 @@
abc(?i)
which in turn is the same as compiling the pattern abc with
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set. In other words, such "top level" set-
tings apply to the whole pattern (unless there are other
changes inside subpatterns). If there is more than one set-
ting of the same option at top level, the rightmost setting
is used.
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set.
In other words, such "top level" settings apply to the whole
pattern (unless there are other changes inside subpatterns).
If there is more than one setting of the same option at top level,
the rightmost setting is used.
If an option change occurs inside a subpattern, the effect
is different. This is a change of behaviour in Perl 5.005.
@ -958,8 +965,7 @@
matches "ab", "aB", "c", and "C", even though when matching
"C" the first branch is abandoned before the option setting.
This is because the effects of option settings happen at
compile time. There would be some very weird behaviour oth-
erwise.
compile time. There would be some very weird behaviour otherwise.
The PCRE-specific options <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
<link linkend="pcre.pattern.modifiers">PCRE_EXTRA</link> can
@ -975,25 +981,26 @@
<title>subpatterns</title>
<literallayout>
Subpatterns are delimited by parentheses (round brackets),
which can be nested. Marking part of a pattern as a subpat-
tern does two things:
which can be nested. Marking part of a pattern as a subpattern
does two things:
1. It localizes a set of alternatives. For example, the pat-
tern
cat(aract|erpillar|)
matches one of the words "cat", "cataract", or "caterpil-
lar". Without the parentheses, it would match "cataract",
matches one of the words "cat", "cataract", or "caterpillar".
Without the parentheses, it would match "cataract",
"erpillar" or the empty string.
2. It sets up the subpattern as a capturing subpattern (as
defined above). When the whole pattern matches, that por-
tion of the subject string that matched the subpattern is
passed back to the caller via the <emphasis>ovector</emphasis> argument of
<function>pcre_exec</function>. Opening parentheses are counted from left to
right (starting from 1) to obtain the numbers of the captur-
ing subpatterns.
defined above). When the whole pattern matches, that portion
of the subject string that matched the subpattern is
passed back to the caller via the <emphasis>ovector</emphasis>
argument of
<function>pcre_exec</function>. Opening parentheses are counted
from left to right (starting from 1) to obtain the numbers of the
capturing subpatterns.
For example, if the string "the red king" is matched against
the pattern
@ -1004,8 +1011,8 @@
and are numbered 1, 2, and 3.
The fact that plain parentheses fulfil two functions is not
always helpful. There are often times when a grouping sub-
pattern is required without a capturing requirement. If an
always helpful. There are often times when a grouping subpattern
is required without a capturing requirement. If an
opening parenthesis is followed by "?:", the subpattern does
not do any capturing, and is not counted when computing the
number of any subsequent capturing subpatterns. For example,
@ -1015,8 +1022,8 @@
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and
are numbered 1 and 2. The maximum number of captured sub-
strings is 99, and the maximum number of all subpatterns,
are numbered 1 and 2. The maximum number of captured substrings
is 99, and the maximum number of all subpatterns,
both capturing and non-capturing, is 200.
As a convenient shorthand, if any option settings are
@ -1072,8 +1079,8 @@
matches exactly 8 digits. An opening curly bracket that
appears in a position where a quantifier is not allowed, or
one that does not match the syntax of a quantifier, is taken
as a literal character. For example, {,6} is not a quantif-
ier, but a literal string of four characters.
as a literal character. For example, {,6} is not a quantifier,
but a literal string of four characters.
The quantifier {0} is permitted, causing the expression to
behave as if the previous item and the quantifier were not
@ -1099,13 +1106,13 @@
fact match no characters, the loop is forcibly broken.
By default, the quantifiers are "greedy", that is, they
match as much as possible (up to the maximum number of per-
mitted times), without causing the rest of the pattern to
match as much as possible (up to the maximum number of permitted
times), without causing the rest of the pattern to
fail. The classic example of where this gives problems is in
trying to match comments in C programs. These appear between
the sequences /* and */ and within the sequence, individual
* and / characters may appear. An attempt to match C com-
ments by applying the pattern
* and / characters may appear. An attempt to match C comments
by applying the pattern
/\*.*\*/
@ -1123,8 +1130,8 @@
/\*.*?\*/
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the pre-
ferred number of matches. Do not confuse this use of ques-
various quantifiers is not otherwise changed, just the preferred
number of matches. Do not confuse this use of ques-
tion mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as
in
@ -1141,33 +1148,32 @@
default behaviour.
When a parenthesized subpattern is quantified with a minimum
repeat count that is greater than 1 or with a limited max-
imum, more store is required for the compiled pattern, in
repeat count that is greater than 1 or with a limited maximum,
more store is required for the compiled pattern, in
proportion to the size of the minimum or maximum.
If a pattern starts with .* or .{0,} and the <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
option (equivalent to Perl's /s) is set, thus allowing the .
to match newlines, then the pattern is implicitly anchored,
because whatever follows will be tried against every charac-
ter position in the subject string, so there is no point in
because whatever follows will be tried against every character
position in the subject string, so there is no point in
retrying the overall match at any position after the first.
PCRE treats such a pattern as though it were preceded by \A.
In cases where it is known that the subject string contains
no newlines, it is worth setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> when the pat-
tern begins with .* in order to obtain this optimization, or
no newlines, it is worth setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> when the pattern begins with .* in order to
obtain this optimization, or
alternatively using ^ to indicate anchoring explicitly.
When a capturing subpattern is repeated, the value captured
is the substring that matched the final iteration. For exam-
ple, after
is the substring that matched the final iteration. For example, after
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the cap-
tured substring is "tweedledee". However, if there are
has matched "tweedledum tweedledee" the value of the captured
substring is "tweedledee". However, if there are
nested capturing subpatterns, the corresponding captured
values may have been set in previous iterations. For exam-
ple, after
values may have been set in previous iterations. For example,
after
/(a|(b))+/
@ -1191,28 +1197,27 @@
left parentheses in the entire pattern. In other words, the
parentheses that are referenced need not be to the left of
the reference for numbers less than 10. See the section
entitled "Backslash" above for further details of the han-
dling of digits following a backslash.
entitled "Backslash" above for further details of the handling
of digits following a backslash.
A back reference matches whatever actually matched the cap-
turing subpattern in the current subject string, rather than
A back reference matches whatever actually matched the capturing
subpattern in the current subject string, rather than
anything matching the subpattern itself. So the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsi-
bility", but not "sense and responsibility". If caseful
matches "sense and sensibility" and "response and responsibility",
but not "sense and responsibility". If caseful
matching is in force at the time of the back reference, then
the case of letters is relevant. For example,
((?i)rah)\s+\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even
though the original capturing subpattern is matched case-
lessly.
though the original capturing subpattern is matched caselessly.
There may be more than one back reference to the same sub-
pattern. If a subpattern has not actually been used in a
There may be more than one back reference to the same subpattern.
If a subpattern has not actually been used in a
particular match, then any back references to it always
fail. For example, the pattern
@ -1229,15 +1234,14 @@
A back reference that occurs inside the parentheses to which
it refers fails when the subpattern is first used, so, for
example, (a\1) never matches. However, such references can
be useful inside repeated subpatterns. For example, the pat-
tern
be useful inside repeated subpatterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababaa" etc. At
each iteration of the subpattern, the back reference matches
the character string corresponding to the previous itera-
tion. In order for this to work, the pattern must be such
the character string corresponding to the previous iteration.
In order for this to work, the pattern must be such
that the first iteration does not need to match the back
reference. This can be done using alternation, as in the
example above, or by a quantifier with a minimum of zero.
@ -1250,8 +1254,8 @@
An assertion is a test on the characters following or
preceding the current matching point that does not actually
consume any characters. The simple assertions coded as \b,
\B, \A, \Z, \z, ^ and $ are described above. More compli-
cated assertions are coded as subpatterns. There are two
\B, \A, \Z, \z, ^ and $ are described above. More complicated
assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in the
subject string, and those that look behind it.
@ -1278,8 +1282,8 @@
when the next three characters are "bar". A lookbehind
assertion is needed to achieve this effect.
Lookbehind assertions start with (?&lt;= for positive asser-
tions and (?&lt;! for negative assertions. For example,
Lookbehind assertions start with (?&lt;= for positive assertions
and (?&lt;! for negative assertions. For example,
(?&lt;!foo)bar
@ -1295,8 +1299,8 @@
(?&lt;!dogs?|cats?)
causes an error at compile time. Branches that match dif-
ferent length strings are permitted only at the top level of
causes an error at compile time. Branches that match different
length strings are permitted only at the top level of
a lookbehind assertion. This is an extension compared with
Perl 5.005, which requires all branches to match the same
length of string. An assertion such as
@ -1304,8 +1308,8 @@
(?&lt;=ab(c|de))
is not permitted, because its single top-level branch can
match two different lengths, but it is acceptable if rewrit-
ten to use two top-level branches:
match two different lengths, but it is acceptable if rewritten
to use two top-level branches:
(?&lt;=abc|abde)
@ -1314,8 +1318,8 @@
by the fixed width and then try to match. If there are
insufficient characters before the current position, the
match is deemed to fail. Lookbehinds in conjunction with
once-only subpatterns can be particularly useful for match-
ing at the ends of strings; an example is given at the end
once-only subpatterns can be particularly useful for matching
at the ends of strings; an example is given at the end
of the section on once-only subpatterns.
Several assertions (of any sort) may occur in succession.
@ -1398,23 +1402,22 @@
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it. Back-
tracking past it to previous items, however, works as nor-
mal.
tracking past it to previous items, however, works as normal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical stan-
dalone pattern would match, if anchored at the current point
matches the string of characters that an identical standalone
pattern would match, if anchored at the current point
in the subject string.
Once-only subpatterns are not capturing subpatterns. Simple
cases such as the above example can be thought of as a max-
imizing repeat that must swallow everything it can. So,
cases such as the above example can be thought of as a maximizing
repeat that must swallow everything it can. So,
while both \d+ and \d+? are prepared to adjust the number of
digits they match in order to make the rest of the pattern
match, (?&gt;\d+) can only match an entire sequence of digits.
This construction can of course contain arbitrarily compli-
cated subpatterns, and it can be nested.
This construction can of course contain arbitrarily complicated
subpatterns, and it can be nested.
Once-only subpatterns can be used in conjunction with look-
behind assertions to specify efficient matching at the end
@ -1442,19 +1445,18 @@
match only the entire string. The subsequent lookbehind
assertion does a single test on the last four characters. If
it fails, the match fails immediately. For long strings,
this approach makes a significant difference to the process-
ing time.
this approach makes a significant difference to the processing time.
When a pattern contains an unlimited repeat inside a subpat-
tern that can itself be repeated an unlimited number of
When a pattern contains an unlimited repeat inside a subpattern
that can itself be repeated an unlimited number of
times, the use of a once-only subpattern is the only way to
avoid some failing matches taking a very long time indeed.
The pattern
(\D+|&lt;\d+>)*[!?]
matches an unlimited number of substrings that either con-
sist of non-digits, or digits enclosed in &lt;>, followed by
matches an unlimited number of substrings that either consist
of non-digits, or digits enclosed in &lt;>, followed by
either ! or ?. When it matches, it runs quickly. However, if
it is applied to
@ -1462,8 +1464,8 @@
it takes a long time before reporting failure. This is
because the string can be divided between the two repeats in
a large number of ways, and all have to be tried. (The exam-
ple used [!?] rather than a single character at the end,
a large number of ways, and all have to be tried. (The example
used [!?] rather than a single character at the end,
because both PCRE and Perl have an optimization that allows
for fast failure when a single character is used. They
remember the last single character that is required for a
@ -1472,16 +1474,15 @@
((?>\D+)|&lt;\d+>)*[!?]
sequences of non-digits cannot be broken, and failure hap-
pens quickly.
sequences of non-digits cannot be broken, and failure happens quickly.
</literallayout>
</refsect2>
<refsect2 id="regexp.reference.conditional">
<title>Conditional subpatterns</title>
<literallayout>
It is possible to cause the matching process to obey a sub-
pattern conditionally or to choose between two alternative
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative
subpatterns, depending on the result of an assertion, or
whether a previous capturing subpattern matched or not. The
two possible forms of conditional subpattern are
@ -1489,16 +1490,16 @@
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
If the condition is satisfied, the yes-pattern is used; oth-
erwise the no-pattern (if present) is used. If there are
If the condition is satisfied, the yes-pattern is used; otherwise
the no-pattern (if present) is used. If there are
more than two alternatives in the subpattern, a compile-time
error occurs.
There are two kinds of condition. If the text between the
parentheses consists of a sequence of digits, then the
condition is satisfied if the capturing subpattern of that
number has previously matched. Consider the following pat-
tern, which contains non-significant white space to make it
number has previously matched. Consider the following pattern,
which contains non-significant white space to make it
more readable (assume the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option) and to
divide it into three parts for ease of discussion:
@ -1519,9 +1520,9 @@
If the condition is not a sequence of digits, it must be an
assertion. This may be a positive or negative lookahead or
lookbehind assertion. Consider this pattern, again contain-
ing non-significant white space, and with the two alterna-
tives on the second line:
lookbehind assertion. Consider this pattern, again containing
non-significant white space, and with the two alternatives on
the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
@ -1563,7 +1564,8 @@
expressions to recurse (amongst other things). The special
item (?R) is provided for the specific case of recursion.
This PCRE pattern solves the parentheses problem (assume
the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option is set so that white space is
the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>
option is set so that white space is
ignored):
\( ( (?>[^()]+) | (?R) )* \)
@ -1575,15 +1577,15 @@
a closing parenthesis.
This particular example pattern contains nested unlimited
repeats, and so the use of a once-only subpattern for match-
ing strings of non-parentheses is important when applying
repeats, and so the use of a once-only subpattern for matching
strings of non-parentheses is important when applying
the pattern to strings that do not match. For example, when
it is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if a once-only sub-
pattern is not used, the match runs for a very long time
it yields "no match" quickly. However, if a once-only subpattern
is not used, the match runs for a very long time
indeed because there are so many different ways the + and *
repeats can carve up the subject, and all have to be tested
before failure can be reported.
@ -1656,8 +1658,8 @@
repeat can match 0, 1, 2, 3, or 4 times, and for each of
those cases other than 0, the + repeats can match different
numbers of times.) When the remainder of the pattern is such
that the entire match is going to fail, PCRE has in princi-
ple to try every possible variation, and this can take an
that the entire match is going to fail, PCRE has in principle
to try every possible variation, and this can take an
extremely long time.
An optimization catches some of the more simple cases such