mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-16 00:48:54 +00:00
white space and some spelling
git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@91117 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
parent
505aacbeef
commit
931ba4788e
1 changed files with 172 additions and 170 deletions
|
@ -1,5 +1,5 @@
|
|||
<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<!-- $Revision: 1.4 $ -->
|
||||
<!-- $Revision: 1.5 $ -->
|
||||
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
|
||||
<refentry id="pcre.pattern.syntax">
|
||||
<refnamediv>
|
||||
|
@ -25,8 +25,8 @@
|
|||
<listitem>
|
||||
<simpara>
|
||||
By default, a whitespace character is any character that
|
||||
the C library function isspace() recognizes, though it is
|
||||
possible to compile PCRE with alternative character type
|
||||
the C library function isspace() recognizes, though it is
|
||||
possible to compile PCRE with alternative character type
|
||||
tables. Normally isspace() matches space, formfeed, newline,
|
||||
carriage return, horizontal tab, and vertical tab. Perl 5 no
|
||||
longer includes vertical tab in its set of whitespace characters.
|
||||
|
@ -38,19 +38,19 @@
|
|||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
PCRE does not allow repeat quantifiers on lookahead
|
||||
PCRE does not allow repeat quantifiers on lookahead
|
||||
assertions. Perl permits them, but they do not mean what you
|
||||
might think. For example, (?!a){3} does not assert that the
|
||||
next three characters are not "a". It just asserts that the
|
||||
next three characters are not "a". It just asserts that the
|
||||
next character is not "a" three times.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Capturing subpatterns that occur inside negative looka-
|
||||
head assertions are counted, but their entries in the
|
||||
offsets vector are never set. Perl sets its numerical vari-
|
||||
ables from any such patterns that are matched before the
|
||||
Capturing subpatterns that occur inside negative looka-
|
||||
head assertions are counted, but their entries in the
|
||||
offsets vector are never set. Perl sets its numerical vari-
|
||||
ables from any such patterns that are matched before the
|
||||
assertion fails to match something (thereby succeeding), but
|
||||
only if the negative lookahead assertion contains just one
|
||||
branch.
|
||||
|
@ -59,8 +59,8 @@
|
|||
<listitem>
|
||||
<simpara>
|
||||
Though binary zero characters are supported in the subject string,
|
||||
they are not allowed in a pattern string because it is passed as a
|
||||
normal C string, terminated by zero. The escape sequence "\\x00" can
|
||||
they are not allowed in a pattern string because it is passed as a
|
||||
normal C string, terminated by zero. The escape sequence "\\x00" can
|
||||
be used in the pattern to represent a binary zero.
|
||||
</simpara>
|
||||
</listitem>
|
||||
|
@ -80,7 +80,7 @@
|
|||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Fairly obviously, PCRE does not support the (?{code})
|
||||
Fairly obviously, PCRE does not support the (?{code})
|
||||
construction.
|
||||
</simpara>
|
||||
</listitem>
|
||||
|
@ -181,7 +181,7 @@
|
|||
<para>
|
||||
There are two different sets of meta-characters: those that
|
||||
are recognized anywhere in the pattern except within square
|
||||
brackets, and those that are recognized in square brackets.
|
||||
brackets, and those that are recognized in square brackets.
|
||||
Outside square brackets, the meta-characters are as follows:
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
|
@ -196,7 +196,7 @@
|
|||
<term><emphasis>^</emphasis></term>
|
||||
<listitem>
|
||||
<simpara>
|
||||
assert start of subject (or line, in multiline mode)
|
||||
assert start of subject (or line, in multiline mode)
|
||||
</simpara>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -298,8 +298,8 @@
|
|||
</varlistentry>
|
||||
</variablelist>
|
||||
|
||||
Part of a pattern that is in square brackets is called a
|
||||
"character class". In a character class the only meta-
|
||||
Part of a pattern that is in square brackets is called a
|
||||
"character class". In a character class the only meta-
|
||||
characters are:
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
|
@ -335,7 +335,7 @@
|
|||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
The following sections describe the use of each of the
|
||||
The following sections describe the use of each of the
|
||||
meta-characters.
|
||||
</para>
|
||||
</refsect2>
|
||||
|
@ -343,16 +343,16 @@
|
|||
<title>backslash</title>
|
||||
<para>
|
||||
The backslash character has several uses. Firstly, if it is
|
||||
followed by a non-alphameric character, it takes away any
|
||||
special meaning that character may have. This use of
|
||||
backslash as an escape character applies both inside and
|
||||
followed by a non-alphanumeric character, it takes away any
|
||||
special meaning that character may have. This use of
|
||||
backslash as an escape character applies both inside and
|
||||
outside character classes.
|
||||
</para>
|
||||
<para>
|
||||
For example, if you want to match a "*" character, you write
|
||||
"\*" in the pattern. This applies whether or not the follow-
|
||||
ing character would otherwise be interpreted as a meta-
|
||||
character, so it is always safe to precede a non-alphameric
|
||||
ing character would otherwise be interpreted as a meta-
|
||||
character, so it is always safe to precede a non-alphanumeric
|
||||
with "\" to specify that it stands for itself. In particu-
|
||||
lar, if you want to match a backslash, you write "\\".
|
||||
</para>
|
||||
|
@ -365,11 +365,11 @@
|
|||
of the pattern.
|
||||
</para>
|
||||
<para>
|
||||
A second use of backslash provides a way of encoding non-
|
||||
printing characters in patterns in a visible manner. There
|
||||
is no restriction on the appearance of non-printing charac-
|
||||
ters, apart from the binary zero that terminates a pattern,
|
||||
but when a pattern is being prepared by text editing, it is
|
||||
A second use of backslash provides a way of encoding non-
|
||||
printing characters in patterns in a visible manner. There
|
||||
is no restriction on the appearance of non-printing characters,
|
||||
apart from the binary zero that terminates a pattern,
|
||||
but when a pattern is being prepared by text editing, it is
|
||||
usually easier to use one of the following escape sequences
|
||||
than the binary character it represents:
|
||||
</para>
|
||||
|
@ -450,38 +450,41 @@
|
|||
</variablelist>
|
||||
</para>
|
||||
<para>
|
||||
The precise effect of "<literal>\cx</literal>" is as follows: if "<literal>x</literal>" is a lower
|
||||
case letter, it is converted to upper case. Then bit 6 of
|
||||
the character (hex 40) is inverted. Thus "<literal>\cz</literal>" becomes hex
|
||||
1A, but "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>" becomes hex 7B.
|
||||
The precise effect of "<literal>\cx</literal>" is as follows:
|
||||
if "<literal>x</literal>" is a lower case letter, it is converted
|
||||
to upper case. Then bit 6 of the character (hex 40) is inverted.
|
||||
Thus "<literal>\cz</literal>" becomes hex 1A, but
|
||||
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
|
||||
becomes hex 7B.
|
||||
</para>
|
||||
<para>
|
||||
After "<literal>\x</literal>", up to two hexadecimal digits are read (letters
|
||||
can be in upper or lower case).
|
||||
After "<literal>\x</literal>", up to two hexadecimal digits are
|
||||
read (letters can be in upper or lower case).
|
||||
</para>
|
||||
<para>
|
||||
After "<literal>\0</literal>" up to two further octal digits are read. In both
|
||||
cases, if there are fewer than two digits, just those that
|
||||
are present are used. Thus the sequence "<literal>\0\x\07</literal>" specifies
|
||||
two binary zeros followed by a BEL character. Make sure you
|
||||
supply two digits after the initial zero if the character
|
||||
After "<literal>\0</literal>" up to two further octal digits are read.
|
||||
In both cases, if there are fewer than two digits, just those that
|
||||
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
|
||||
specifies two binary zeros followed by a BEL character. Make sure you
|
||||
supply two digits after the initial zero if the character
|
||||
that follows is itself an octal digit.
|
||||
</para>
|
||||
<para>
|
||||
The handling of a backslash followed by a digit other than 0
|
||||
is complicated. Outside a character class, PCRE reads it
|
||||
is complicated. Outside a character class, PCRE reads it
|
||||
and any following digits as a decimal number. If the number
|
||||
is less than 10, or if there have been at least that many
|
||||
previous capturing left parentheses in the expression, the
|
||||
entire sequence is taken as a <emphasis>back</emphasis> <emphasis>reference</emphasis>. A description
|
||||
entire sequence is taken as a <emphasis>back</emphasis>
|
||||
<emphasis>reference</emphasis>. A description
|
||||
of how this works is given later, following the discussion
|
||||
of parenthesized subpatterns.
|
||||
</para>
|
||||
<para>
|
||||
Inside a character class, or if the decimal number is
|
||||
greater than 9 and there have not been that many capturing
|
||||
subpatterns, PCRE re-reads up to three octal digits follow-
|
||||
ing the backslash, and generates a single byte from the
|
||||
greater than 9 and there have not been that many capturing
|
||||
subpatterns, PCRE re-reads up to three octal digits following
|
||||
the backslash, and generates a single byte from the
|
||||
least significant 8 bits of the value. Any subsequent digits
|
||||
stand for themselves. For example:
|
||||
</para>
|
||||
|
@ -566,15 +569,15 @@
|
|||
</variablelist>
|
||||
</para>
|
||||
<para>
|
||||
Note that octal values of 100 or greater must not be intro-
|
||||
duced by a leading zero, because no more than three octal
|
||||
Note that octal values of 100 or greater must not be intro-
|
||||
duced by a leading zero, because no more than three octal
|
||||
digits are ever read.
|
||||
</para>
|
||||
<para>
|
||||
All the sequences that define a single byte value can be
|
||||
All the sequences that define a single byte value can be
|
||||
used both inside and outside character classes. In addition,
|
||||
inside a character class, the sequence "<literal>\b</literal>" is interpreted
|
||||
as the backspace character (hex 08). Outside a character
|
||||
inside a character class, the sequence "<literal>\b</literal>"
|
||||
is interpreted as the backspace character (hex 08). Outside a character
|
||||
class it has a different meaning (see below).
|
||||
</para>
|
||||
<para>
|
||||
|
@ -635,32 +638,32 @@
|
|||
</para>
|
||||
<para>
|
||||
Each pair of escape sequences partitions the complete set of
|
||||
characters into two disjoint sets. Any given character
|
||||
characters into two disjoint sets. Any given character
|
||||
matches one, and only one, of each pair.
|
||||
</para>
|
||||
<para>
|
||||
A "word" character is any letter or digit or the underscore
|
||||
A "word" character is any letter or digit or the underscore
|
||||
character, that is, any character which can be part of a
|
||||
Perl "<literal>word</literal>". The definition of letters and digits is
|
||||
controlled by PCRE's character tables, and may vary if locale-specific
|
||||
matching is taking place (see "Locale support"
|
||||
controlled by PCRE's character tables, and may vary if locale-specific
|
||||
matching is taking place (see "Locale support"
|
||||
above). For example, in the "fr" (French) locale, some char-
|
||||
acter codes greater than 128 are used for accented letters,
|
||||
acter codes greater than 128 are used for accented letters,
|
||||
and these are matched by <literal>\w</literal>.
|
||||
</para>
|
||||
<para>
|
||||
These character type sequences can appear both inside and
|
||||
These character type sequences can appear both inside and
|
||||
outside character classes. They each match one character of
|
||||
the appropriate type. If the current matching point is at
|
||||
the appropriate type. If the current matching point is at
|
||||
the end of the subject string, all of them fail, since there
|
||||
is no character to match.
|
||||
</para>
|
||||
<para>
|
||||
The fourth use of backslash is for certain simple asser-
|
||||
tions. An assertion specifies a condition that has to be met
|
||||
at a particular point in a match, without consuming any
|
||||
characters from the subject string. The use of subpatterns
|
||||
for more complicated assertions is described below. The
|
||||
at a particular point in a match, without consuming any
|
||||
characters from the subject string. The use of subpatterns
|
||||
for more complicated assertions is described below. The
|
||||
backslashed assertions are
|
||||
</para>
|
||||
<para>
|
||||
|
@ -693,7 +696,7 @@
|
|||
<term><emphasis>\Z</emphasis></term>
|
||||
<listitem>
|
||||
<simpara>
|
||||
end of subject or newline at end (independent of
|
||||
end of subject or newline at end (independent of
|
||||
multiline mode)
|
||||
</simpara>
|
||||
</listitem>
|
||||
|
@ -702,7 +705,7 @@
|
|||
<term><emphasis>\z</emphasis></term>
|
||||
<listitem>
|
||||
<simpara>
|
||||
end of subject (independent of multiline mode)
|
||||
end of subject(independent of multiline mode)
|
||||
</simpara>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -714,20 +717,23 @@
|
|||
character, inside a character class).
|
||||
</para>
|
||||
<para>
|
||||
A word boundary is a position in the subject string where
|
||||
A word boundary is a position in the subject string where
|
||||
the current character and the previous character do not both
|
||||
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
|
||||
<literal>\w</literal> and the other matches
|
||||
<literal>\W</literal>), or the start or end of the string if the first or last
|
||||
character matches \w, respectively.
|
||||
<literal>\W</literal>), or the start or end of the string if the first
|
||||
or last character matches \w, respectively.
|
||||
</para>
|
||||
<para>
|
||||
The <literal>\A</literal>, <literal>\Z</literal>, and <literal>\z</literal> assertions differ from the traditional
|
||||
The <literal>\A</literal>, <literal>\Z</literal>, and
|
||||
<literal>\z</literal> assertions differ from the traditional
|
||||
circumflex and dollar (described below) in that they only
|
||||
ever match at the very start and end of the subject string,
|
||||
whatever options are set. They are not affected by the
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_NOTBOL</link> or <link linkend="pcre.pattern.modifiers">PCRE_NOTEOL</link> options. The difference between
|
||||
<literal>\Z</literal> and <literal>\z</literal> is that <literal>\Z</literal>
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_NOTBOL</link> or
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_NOTEOL</link> options.
|
||||
The difference between <literal>\Z</literal> and
|
||||
<literal>\z</literal> is that <literal>\Z</literal>
|
||||
matches before a newline that is the
|
||||
last character of the string as well as at the end of the
|
||||
string, whereas <literal>\z</literal> matches only at the end.
|
||||
|
@ -744,7 +750,7 @@
|
|||
different meaning (see below).
|
||||
|
||||
Circumflex need not be the first character of the pattern if
|
||||
a number of alternatives are involved, but it should be the
|
||||
a number of alternatives are involved, but it should be the
|
||||
first thing in each alternative in which it appears if the
|
||||
pattern is ever to match that branch. If all possible alter-
|
||||
natives start with a circumflex, that is, if the pattern is
|
||||
|
@ -763,7 +769,8 @@
|
|||
|
||||
The meaning of dollar can be changed so that it matches only
|
||||
at the very end of the string, by setting the
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> option at compile or matching time. This
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
||||
option at compile or matching time. This
|
||||
does not affect the \Z assertion.
|
||||
|
||||
The meanings of the circumflex and dollar characters are
|
||||
|
@ -873,7 +880,7 @@
|
|||
For example, the class [^\W_] matches any letter or digit,
|
||||
but not underscore.
|
||||
|
||||
All non-alphameric characters other than \, -, ^ (at the
|
||||
All non-alphanumeric characters other than \, -, ^ (at the
|
||||
start) and the terminating ] are non-special in character
|
||||
classes, but it does no harm if they are escaped.
|
||||
</literallayout>
|
||||
|
@ -887,8 +894,8 @@
|
|||
|
||||
gilbert|sullivan
|
||||
|
||||
matches either "gilbert" or "sullivan". Any number of alter-
|
||||
natives may appear, and an empty alternative is permitted
|
||||
matches either "gilbert" or "sullivan". Any number of alternatives
|
||||
may appear, and an empty alternative is permitted
|
||||
(matching the empty string). The matching process tries
|
||||
each alternative in turn, from left to right, and the first
|
||||
one that succeeds is used. If the alternatives are within a
|
||||
|
@ -933,11 +940,11 @@
|
|||
abc(?i)
|
||||
|
||||
which in turn is the same as compiling the pattern abc with
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set. In other words, such "top level" set-
|
||||
tings apply to the whole pattern (unless there are other
|
||||
changes inside subpatterns). If there is more than one set-
|
||||
ting of the same option at top level, the rightmost setting
|
||||
is used.
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set.
|
||||
In other words, such "top level" settings apply to the whole
|
||||
pattern (unless there are other changes inside subpatterns).
|
||||
If there is more than one setting of the same option at top level,
|
||||
the rightmost setting is used.
|
||||
|
||||
If an option change occurs inside a subpattern, the effect
|
||||
is different. This is a change of behaviour in Perl 5.005.
|
||||
|
@ -958,8 +965,7 @@
|
|||
matches "ab", "aB", "c", and "C", even though when matching
|
||||
"C" the first branch is abandoned before the option setting.
|
||||
This is because the effects of option settings happen at
|
||||
compile time. There would be some very weird behaviour oth-
|
||||
erwise.
|
||||
compile time. There would be some very weird behaviour otherwise.
|
||||
|
||||
The PCRE-specific options <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
|
||||
<link linkend="pcre.pattern.modifiers">PCRE_EXTRA</link> can
|
||||
|
@ -975,25 +981,26 @@
|
|||
<title>subpatterns</title>
|
||||
<literallayout>
|
||||
Subpatterns are delimited by parentheses (round brackets),
|
||||
which can be nested. Marking part of a pattern as a subpat-
|
||||
tern does two things:
|
||||
which can be nested. Marking part of a pattern as a subpattern
|
||||
does two things:
|
||||
|
||||
1. It localizes a set of alternatives. For example, the pat-
|
||||
tern
|
||||
|
||||
cat(aract|erpillar|)
|
||||
|
||||
matches one of the words "cat", "cataract", or "caterpil-
|
||||
lar". Without the parentheses, it would match "cataract",
|
||||
matches one of the words "cat", "cataract", or "caterpillar".
|
||||
Without the parentheses, it would match "cataract",
|
||||
"erpillar" or the empty string.
|
||||
|
||||
2. It sets up the subpattern as a capturing subpattern (as
|
||||
defined above). When the whole pattern matches, that por-
|
||||
tion of the subject string that matched the subpattern is
|
||||
passed back to the caller via the <emphasis>ovector</emphasis> argument of
|
||||
<function>pcre_exec</function>. Opening parentheses are counted from left to
|
||||
right (starting from 1) to obtain the numbers of the captur-
|
||||
ing subpatterns.
|
||||
defined above). When the whole pattern matches, that portion
|
||||
of the subject string that matched the subpattern is
|
||||
passed back to the caller via the <emphasis>ovector</emphasis>
|
||||
argument of
|
||||
<function>pcre_exec</function>. Opening parentheses are counted
|
||||
from left to right (starting from 1) to obtain the numbers of the
|
||||
capturing subpatterns.
|
||||
|
||||
For example, if the string "the red king" is matched against
|
||||
the pattern
|
||||
|
@ -1004,8 +1011,8 @@
|
|||
and are numbered 1, 2, and 3.
|
||||
|
||||
The fact that plain parentheses fulfil two functions is not
|
||||
always helpful. There are often times when a grouping sub-
|
||||
pattern is required without a capturing requirement. If an
|
||||
always helpful. There are often times when a grouping subpattern
|
||||
is required without a capturing requirement. If an
|
||||
opening parenthesis is followed by "?:", the subpattern does
|
||||
not do any capturing, and is not counted when computing the
|
||||
number of any subsequent capturing subpatterns. For example,
|
||||
|
@ -1015,8 +1022,8 @@
|
|||
the ((?:red|white) (king|queen))
|
||||
|
||||
the captured substrings are "white queen" and "queen", and
|
||||
are numbered 1 and 2. The maximum number of captured sub-
|
||||
strings is 99, and the maximum number of all subpatterns,
|
||||
are numbered 1 and 2. The maximum number of captured substrings
|
||||
is 99, and the maximum number of all subpatterns,
|
||||
both capturing and non-capturing, is 200.
|
||||
|
||||
As a convenient shorthand, if any option settings are
|
||||
|
@ -1072,8 +1079,8 @@
|
|||
matches exactly 8 digits. An opening curly bracket that
|
||||
appears in a position where a quantifier is not allowed, or
|
||||
one that does not match the syntax of a quantifier, is taken
|
||||
as a literal character. For example, {,6} is not a quantif-
|
||||
ier, but a literal string of four characters.
|
||||
as a literal character. For example, {,6} is not a quantifier,
|
||||
but a literal string of four characters.
|
||||
|
||||
The quantifier {0} is permitted, causing the expression to
|
||||
behave as if the previous item and the quantifier were not
|
||||
|
@ -1099,13 +1106,13 @@
|
|||
fact match no characters, the loop is forcibly broken.
|
||||
|
||||
By default, the quantifiers are "greedy", that is, they
|
||||
match as much as possible (up to the maximum number of per-
|
||||
mitted times), without causing the rest of the pattern to
|
||||
match as much as possible (up to the maximum number of permitted
|
||||
times), without causing the rest of the pattern to
|
||||
fail. The classic example of where this gives problems is in
|
||||
trying to match comments in C programs. These appear between
|
||||
the sequences /* and */ and within the sequence, individual
|
||||
* and / characters may appear. An attempt to match C com-
|
||||
ments by applying the pattern
|
||||
* and / characters may appear. An attempt to match C comments
|
||||
by applying the pattern
|
||||
|
||||
/\*.*\*/
|
||||
|
||||
|
@ -1123,8 +1130,8 @@
|
|||
/\*.*?\*/
|
||||
|
||||
does the right thing with the C comments. The meaning of the
|
||||
various quantifiers is not otherwise changed, just the pre-
|
||||
ferred number of matches. Do not confuse this use of ques-
|
||||
various quantifiers is not otherwise changed, just the preferred
|
||||
number of matches. Do not confuse this use of ques-
|
||||
tion mark with its use as a quantifier in its own right.
|
||||
Because it has two uses, it can sometimes appear doubled, as
|
||||
in
|
||||
|
@ -1141,33 +1148,32 @@
|
|||
default behaviour.
|
||||
|
||||
When a parenthesized subpattern is quantified with a minimum
|
||||
repeat count that is greater than 1 or with a limited max-
|
||||
imum, more store is required for the compiled pattern, in
|
||||
repeat count that is greater than 1 or with a limited maximum,
|
||||
more store is required for the compiled pattern, in
|
||||
proportion to the size of the minimum or maximum.
|
||||
|
||||
If a pattern starts with .* or .{0,} and the <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
|
||||
option (equivalent to Perl's /s) is set, thus allowing the .
|
||||
to match newlines, then the pattern is implicitly anchored,
|
||||
because whatever follows will be tried against every charac-
|
||||
ter position in the subject string, so there is no point in
|
||||
because whatever follows will be tried against every character
|
||||
position in the subject string, so there is no point in
|
||||
retrying the overall match at any position after the first.
|
||||
PCRE treats such a pattern as though it were preceded by \A.
|
||||
In cases where it is known that the subject string contains
|
||||
no newlines, it is worth setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> when the pat-
|
||||
tern begins with .* in order to obtain this optimization, or
|
||||
no newlines, it is worth setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> when the pattern begins with .* in order to
|
||||
obtain this optimization, or
|
||||
alternatively using ^ to indicate anchoring explicitly.
|
||||
|
||||
When a capturing subpattern is repeated, the value captured
|
||||
is the substring that matched the final iteration. For exam-
|
||||
ple, after
|
||||
is the substring that matched the final iteration. For example, after
|
||||
|
||||
(tweedle[dume]{3}\s*)+
|
||||
|
||||
has matched "tweedledum tweedledee" the value of the cap-
|
||||
tured substring is "tweedledee". However, if there are
|
||||
has matched "tweedledum tweedledee" the value of the captured
|
||||
substring is "tweedledee". However, if there are
|
||||
nested capturing subpatterns, the corresponding captured
|
||||
values may have been set in previous iterations. For exam-
|
||||
ple, after
|
||||
values may have been set in previous iterations. For example,
|
||||
after
|
||||
|
||||
/(a|(b))+/
|
||||
|
||||
|
@ -1191,28 +1197,27 @@
|
|||
left parentheses in the entire pattern. In other words, the
|
||||
parentheses that are referenced need not be to the left of
|
||||
the reference for numbers less than 10. See the section
|
||||
entitled "Backslash" above for further details of the han-
|
||||
dling of digits following a backslash.
|
||||
entitled "Backslash" above for further details of the handling
|
||||
of digits following a backslash.
|
||||
|
||||
A back reference matches whatever actually matched the cap-
|
||||
turing subpattern in the current subject string, rather than
|
||||
A back reference matches whatever actually matched the capturing
|
||||
subpattern in the current subject string, rather than
|
||||
anything matching the subpattern itself. So the pattern
|
||||
|
||||
(sens|respons)e and \1ibility
|
||||
|
||||
matches "sense and sensibility" and "response and responsi-
|
||||
bility", but not "sense and responsibility". If caseful
|
||||
matches "sense and sensibility" and "response and responsibility",
|
||||
but not "sense and responsibility". If caseful
|
||||
matching is in force at the time of the back reference, then
|
||||
the case of letters is relevant. For example,
|
||||
|
||||
((?i)rah)\s+\1
|
||||
|
||||
matches "rah rah" and "RAH RAH", but not "RAH rah", even
|
||||
though the original capturing subpattern is matched case-
|
||||
lessly.
|
||||
though the original capturing subpattern is matched caselessly.
|
||||
|
||||
There may be more than one back reference to the same sub-
|
||||
pattern. If a subpattern has not actually been used in a
|
||||
There may be more than one back reference to the same subpattern.
|
||||
If a subpattern has not actually been used in a
|
||||
particular match, then any back references to it always
|
||||
fail. For example, the pattern
|
||||
|
||||
|
@ -1229,15 +1234,14 @@
|
|||
A back reference that occurs inside the parentheses to which
|
||||
it refers fails when the subpattern is first used, so, for
|
||||
example, (a\1) never matches. However, such references can
|
||||
be useful inside repeated subpatterns. For example, the pat-
|
||||
tern
|
||||
be useful inside repeated subpatterns. For example, the pattern
|
||||
|
||||
(a|b\1)+
|
||||
|
||||
matches any number of "a"s and also "aba", "ababaa" etc. At
|
||||
each iteration of the subpattern, the back reference matches
|
||||
the character string corresponding to the previous itera-
|
||||
tion. In order for this to work, the pattern must be such
|
||||
the character string corresponding to the previous iteration.
|
||||
In order for this to work, the pattern must be such
|
||||
that the first iteration does not need to match the back
|
||||
reference. This can be done using alternation, as in the
|
||||
example above, or by a quantifier with a minimum of zero.
|
||||
|
@ -1250,8 +1254,8 @@
|
|||
An assertion is a test on the characters following or
|
||||
preceding the current matching point that does not actually
|
||||
consume any characters. The simple assertions coded as \b,
|
||||
\B, \A, \Z, \z, ^ and $ are described above. More compli-
|
||||
cated assertions are coded as subpatterns. There are two
|
||||
\B, \A, \Z, \z, ^ and $ are described above. More complicated
|
||||
assertions are coded as subpatterns. There are two
|
||||
kinds: those that look ahead of the current position in the
|
||||
subject string, and those that look behind it.
|
||||
|
||||
|
@ -1278,8 +1282,8 @@
|
|||
when the next three characters are "bar". A lookbehind
|
||||
assertion is needed to achieve this effect.
|
||||
|
||||
Lookbehind assertions start with (?<= for positive asser-
|
||||
tions and (?<! for negative assertions. For example,
|
||||
Lookbehind assertions start with (?<= for positive assertions
|
||||
and (?<! for negative assertions. For example,
|
||||
|
||||
(?<!foo)bar
|
||||
|
||||
|
@ -1295,8 +1299,8 @@
|
|||
|
||||
(?<!dogs?|cats?)
|
||||
|
||||
causes an error at compile time. Branches that match dif-
|
||||
ferent length strings are permitted only at the top level of
|
||||
causes an error at compile time. Branches that match different
|
||||
length strings are permitted only at the top level of
|
||||
a lookbehind assertion. This is an extension compared with
|
||||
Perl 5.005, which requires all branches to match the same
|
||||
length of string. An assertion such as
|
||||
|
@ -1304,8 +1308,8 @@
|
|||
(?<=ab(c|de))
|
||||
|
||||
is not permitted, because its single top-level branch can
|
||||
match two different lengths, but it is acceptable if rewrit-
|
||||
ten to use two top-level branches:
|
||||
match two different lengths, but it is acceptable if rewritten
|
||||
to use two top-level branches:
|
||||
|
||||
(?<=abc|abde)
|
||||
|
||||
|
@ -1314,8 +1318,8 @@
|
|||
by the fixed width and then try to match. If there are
|
||||
insufficient characters before the current position, the
|
||||
match is deemed to fail. Lookbehinds in conjunction with
|
||||
once-only subpatterns can be particularly useful for match-
|
||||
ing at the ends of strings; an example is given at the end
|
||||
once-only subpatterns can be particularly useful for matching
|
||||
at the ends of strings; an example is given at the end
|
||||
of the section on once-only subpatterns.
|
||||
|
||||
Several assertions (of any sort) may occur in succession.
|
||||
|
@ -1398,23 +1402,22 @@
|
|||
This kind of parenthesis "locks up" the part of the pattern
|
||||
it contains once it has matched, and a failure further into
|
||||
the pattern is prevented from backtracking into it. Back-
|
||||
tracking past it to previous items, however, works as nor-
|
||||
mal.
|
||||
tracking past it to previous items, however, works as normal.
|
||||
|
||||
An alternative description is that a subpattern of this type
|
||||
matches the string of characters that an identical stan-
|
||||
dalone pattern would match, if anchored at the current point
|
||||
matches the string of characters that an identical standalone
|
||||
pattern would match, if anchored at the current point
|
||||
in the subject string.
|
||||
|
||||
Once-only subpatterns are not capturing subpatterns. Simple
|
||||
cases such as the above example can be thought of as a max-
|
||||
imizing repeat that must swallow everything it can. So,
|
||||
cases such as the above example can be thought of as a maximizing
|
||||
repeat that must swallow everything it can. So,
|
||||
while both \d+ and \d+? are prepared to adjust the number of
|
||||
digits they match in order to make the rest of the pattern
|
||||
match, (?>\d+) can only match an entire sequence of digits.
|
||||
|
||||
This construction can of course contain arbitrarily compli-
|
||||
cated subpatterns, and it can be nested.
|
||||
This construction can of course contain arbitrarily complicated
|
||||
subpatterns, and it can be nested.
|
||||
|
||||
Once-only subpatterns can be used in conjunction with look-
|
||||
behind assertions to specify efficient matching at the end
|
||||
|
@ -1442,19 +1445,18 @@
|
|||
match only the entire string. The subsequent lookbehind
|
||||
assertion does a single test on the last four characters. If
|
||||
it fails, the match fails immediately. For long strings,
|
||||
this approach makes a significant difference to the process-
|
||||
ing time.
|
||||
this approach makes a significant difference to the processing time.
|
||||
|
||||
When a pattern contains an unlimited repeat inside a subpat-
|
||||
tern that can itself be repeated an unlimited number of
|
||||
When a pattern contains an unlimited repeat inside a subpattern
|
||||
that can itself be repeated an unlimited number of
|
||||
times, the use of a once-only subpattern is the only way to
|
||||
avoid some failing matches taking a very long time indeed.
|
||||
The pattern
|
||||
|
||||
(\D+|<\d+>)*[!?]
|
||||
|
||||
matches an unlimited number of substrings that either con-
|
||||
sist of non-digits, or digits enclosed in <>, followed by
|
||||
matches an unlimited number of substrings that either consist
|
||||
of non-digits, or digits enclosed in <>, followed by
|
||||
either ! or ?. When it matches, it runs quickly. However, if
|
||||
it is applied to
|
||||
|
||||
|
@ -1462,8 +1464,8 @@
|
|||
|
||||
it takes a long time before reporting failure. This is
|
||||
because the string can be divided between the two repeats in
|
||||
a large number of ways, and all have to be tried. (The exam-
|
||||
ple used [!?] rather than a single character at the end,
|
||||
a large number of ways, and all have to be tried. (The example
|
||||
used [!?] rather than a single character at the end,
|
||||
because both PCRE and Perl have an optimization that allows
|
||||
for fast failure when a single character is used. They
|
||||
remember the last single character that is required for a
|
||||
|
@ -1472,16 +1474,15 @@
|
|||
|
||||
((?>\D+)|<\d+>)*[!?]
|
||||
|
||||
sequences of non-digits cannot be broken, and failure hap-
|
||||
pens quickly.
|
||||
sequences of non-digits cannot be broken, and failure happens quickly.
|
||||
</literallayout>
|
||||
</refsect2>
|
||||
|
||||
<refsect2 id="regexp.reference.conditional">
|
||||
<title>Conditional subpatterns</title>
|
||||
<literallayout>
|
||||
It is possible to cause the matching process to obey a sub-
|
||||
pattern conditionally or to choose between two alternative
|
||||
It is possible to cause the matching process to obey a subpattern
|
||||
conditionally or to choose between two alternative
|
||||
subpatterns, depending on the result of an assertion, or
|
||||
whether a previous capturing subpattern matched or not. The
|
||||
two possible forms of conditional subpattern are
|
||||
|
@ -1489,16 +1490,16 @@
|
|||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
|
||||
If the condition is satisfied, the yes-pattern is used; oth-
|
||||
erwise the no-pattern (if present) is used. If there are
|
||||
If the condition is satisfied, the yes-pattern is used; otherwise
|
||||
the no-pattern (if present) is used. If there are
|
||||
more than two alternatives in the subpattern, a compile-time
|
||||
error occurs.
|
||||
|
||||
There are two kinds of condition. If the text between the
|
||||
parentheses consists of a sequence of digits, then the
|
||||
condition is satisfied if the capturing subpattern of that
|
||||
number has previously matched. Consider the following pat-
|
||||
tern, which contains non-significant white space to make it
|
||||
number has previously matched. Consider the following pattern,
|
||||
which contains non-significant white space to make it
|
||||
more readable (assume the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option) and to
|
||||
divide it into three parts for ease of discussion:
|
||||
|
||||
|
@ -1519,9 +1520,9 @@
|
|||
|
||||
If the condition is not a sequence of digits, it must be an
|
||||
assertion. This may be a positive or negative lookahead or
|
||||
lookbehind assertion. Consider this pattern, again contain-
|
||||
ing non-significant white space, and with the two alterna-
|
||||
tives on the second line:
|
||||
lookbehind assertion. Consider this pattern, again containing
|
||||
non-significant white space, and with the two alternatives on
|
||||
the second line:
|
||||
|
||||
(?(?=[^a-z]*[a-z])
|
||||
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
||||
|
@ -1563,7 +1564,8 @@
|
|||
expressions to recurse (amongst other things). The special
|
||||
item (?R) is provided for the specific case of recursion.
|
||||
This PCRE pattern solves the parentheses problem (assume
|
||||
the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option is set so that white space is
|
||||
the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>
|
||||
option is set so that white space is
|
||||
ignored):
|
||||
|
||||
\( ( (?>[^()]+) | (?R) )* \)
|
||||
|
@ -1575,15 +1577,15 @@
|
|||
a closing parenthesis.
|
||||
|
||||
This particular example pattern contains nested unlimited
|
||||
repeats, and so the use of a once-only subpattern for match-
|
||||
ing strings of non-parentheses is important when applying
|
||||
repeats, and so the use of a once-only subpattern for matching
|
||||
strings of non-parentheses is important when applying
|
||||
the pattern to strings that do not match. For example, when
|
||||
it is applied to
|
||||
|
||||
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
|
||||
|
||||
it yields "no match" quickly. However, if a once-only sub-
|
||||
pattern is not used, the match runs for a very long time
|
||||
it yields "no match" quickly. However, if a once-only subpattern
|
||||
is not used, the match runs for a very long time
|
||||
indeed because there are so many different ways the + and *
|
||||
repeats can carve up the subject, and all have to be tested
|
||||
before failure can be reported.
|
||||
|
@ -1656,8 +1658,8 @@
|
|||
repeat can match 0, 1, 2, 3, or 4 times, and for each of
|
||||
those cases other than 0, the + repeats can match different
|
||||
numbers of times.) When the remainder of the pattern is such
|
||||
that the entire match is going to fail, PCRE has in princi-
|
||||
ple to try every possible variation, and this can take an
|
||||
that the entire match is going to fail, PCRE has in principle
|
||||
to try every possible variation, and this can take an
|
||||
extremely long time.
|
||||
|
||||
An optimization catches some of the more simple cases such
|
||||
|
|
Loading…
Reference in a new issue