mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-19 18:38:55 +00:00

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@147268 c90b9560-bf6c-de11-be94-00142212c4b1
1823 lines
71 KiB
XML
1823 lines
71 KiB
XML
<?xml version="1.0" encoding="iso-8859-1"?>
|
|
<!-- $Revision: 1.9 $ -->
|
|
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
|
|
<refentry id="pcre.pattern.syntax">
|
|
<refnamediv>
|
|
<refname>Pattern Syntax</refname>
|
|
<refpurpose>Describes PCRE regex syntax</refpurpose>
|
|
</refnamediv>
|
|
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<simpara>
|
|
The PCRE library is a set of functions that implement regular
|
|
expression pattern matching using the same syntax and semantics
|
|
as Perl 5, with just a few differences (see below). The current
|
|
implementation corresponds to Perl 5.005.
|
|
</simpara>
|
|
</refsect1>
|
|
|
|
<refsect1>
|
|
<title>Differences From Perl</title>
|
|
<para>
|
|
The differences described here are with respect to Perl 5.005.
|
|
<orderedlist>
|
|
<listitem>
|
|
<simpara>
|
|
By default, a whitespace character is any character that
|
|
the C library function isspace() recognizes, though it is
|
|
possible to compile PCRE with alternative character type
|
|
tables. Normally isspace() matches space, formfeed, newline,
|
|
carriage return, horizontal tab, and vertical tab. Perl 5 no
|
|
longer includes vertical tab in its set of whitespace characters.
|
|
The \v escape that was in the Perl documentation for
|
|
a long time was never in fact recognized. However, the character
|
|
itself was treated as whitespace at least up to 5.002.
|
|
In 5.004 and 5.005 it does not match \s.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
PCRE does not allow repeat quantifiers on lookahead
|
|
assertions. Perl permits them, but they do not mean what you
|
|
might think. For example, (?!a){3} does not assert that the
|
|
next three characters are not "a". It just asserts that the
|
|
next character is not "a" three times.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Capturing subpatterns that occur inside negative
|
|
lookahead assertions are counted, but their entries in the
|
|
offsets vector are never set. Perl sets its numerical
|
|
variables from any such patterns that are matched before the
|
|
assertion fails to match something (thereby succeeding), but
|
|
only if the negative lookahead assertion contains just one
|
|
branch.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Though binary zero characters are supported in the subject string,
|
|
they are not allowed in a pattern string because it is passed as a
|
|
normal C string, terminated by zero. The escape sequence "\\x00" can
|
|
be used in the pattern to represent a binary zero.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
The following Perl escape sequences are not supported:
|
|
\l, \u, \L, \U, \E, \Q. In fact these are implemented by
|
|
Perl's general string-handling and are not part of its
|
|
pattern matching engine.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
The Perl \G assertion is not supported as it is not
|
|
relevant to single pattern matches.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Fairly obviously, PCRE does not support the (?{code})
|
|
construction.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
There are at the time of writing some oddities in Perl
|
|
5.005_02 concerned with the settings of captured strings
|
|
when part of a pattern is repeated. For example, matching
|
|
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
|
|
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
|
|
unset. However, if the pattern is changed to
|
|
/^(aa(b(b))?)+$/ then $2 (and $3) get set.
|
|
In Perl 5.004 $2 is set in both cases, and that is also &true;
|
|
of PCRE. If in the future Perl changes to a consistent state
|
|
that is different, PCRE may change to follow.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
Another as yet unresolved discrepancy is that in Perl
|
|
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
|
|
"a", whereas in PCRE it does not. However, in both Perl and
|
|
PCRE /^(a)?a/ matched against "a" leaves $1 unset.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
PCRE provides some extensions to the Perl regular
|
|
expression facilities:
|
|
<orderedlist>
|
|
<listitem>
|
|
<simpara>
|
|
Although lookbehind assertions must match fixed length
|
|
strings, each alternative branch of a lookbehind assertion
|
|
can match a different length of string. Perl 5.005 requires
|
|
them all to have the same length.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
If <link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> is set and
|
|
<link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> is not
|
|
set, the $ meta-character matches only at the very end of
|
|
the string.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
If <link linkend="pcre.pattern.modifiers">PCRE_EXTRA</link> is set, a backslash followed by a letter
|
|
with no special meaning is faulted.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
If <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> is set, the greediness of the
|
|
repetition quantifiers is inverted, that is, by default they are
|
|
not greedy, but if followed by a question mark they are.
|
|
</simpara>
|
|
</listitem>
|
|
</orderedlist>
|
|
</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
</para>
|
|
</refsect1>
|
|
|
|
<refsect1 id="regexp.reference">
|
|
<title>Regular Expression Details</title>
|
|
<refsect2 id="regexp.introduction">
|
|
<title>Introduction</title>
|
|
<para>
|
|
The syntax and semantics of the regular expressions
|
|
supported by PCRE are described below. Regular expressions are
|
|
also described in the Perl documentation and in a number of
|
|
other books, some of which have copious examples. Jeffrey
|
|
Friedl's "Mastering Regular Expressions", published by
|
|
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
|
|
The description here is intended as reference documentation.
|
|
</para>
|
|
<para>
|
|
A regular expression is a pattern that is matched against a
|
|
subject string from left to right. Most characters stand for
|
|
themselves in a pattern, and match the corresponding
|
|
characters in the subject. As a trivial example, the pattern
|
|
<literal>The quick brown fox</literal>
|
|
matches a portion of a subject string that is identical to
|
|
itself.
|
|
</para>
|
|
</refsect2>
|
|
<refsect2 id="regexp.reference.meta">
|
|
<title>Meta-characters</title>
|
|
<para>
|
|
The power of regular expressions comes from the
|
|
ability to include alternatives and repetitions in the
|
|
pattern. These are encoded in the pattern by the use of
|
|
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
|
|
are interpreted in some special way.
|
|
</para>
|
|
<para>
|
|
There are two different sets of meta-characters: those that
|
|
are recognized anywhere in the pattern except within square
|
|
brackets, and those that are recognized in square brackets.
|
|
Outside square brackets, the meta-characters are as follows:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
general escape character with several uses
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>^</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
assert start of subject (or line, in multiline mode)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>$</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
assert end of subject (or line, in multiline mode)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>.</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
match any character except newline (by default)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>[</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
start character class definition
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>]</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
end character class definition
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>|</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
start of alternative branch
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>(</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
start subpattern
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>)</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
end subpattern
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>?</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>*</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
0 or more quantifier
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>+</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
1 or more quantifier
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>{</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
start min/max quantifier
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>}</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
end min/max quantifier
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
Part of a pattern that is in square brackets is called a
|
|
"character class". In a character class the only
|
|
meta-characters are:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
general escape character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>^</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
negate the class, but only if the first character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>-</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
indicates character range
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>]</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
terminates the character class
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
The following sections describe the use of each of the
|
|
meta-characters.
|
|
</para>
|
|
</refsect2>
|
|
<refsect2 id="regexp.reference.backslash">
|
|
<title>backslash</title>
|
|
<para>
|
|
The backslash character has several uses. Firstly, if it is
|
|
followed by a non-alphanumeric character, it takes away any
|
|
special meaning that character may have. This use of
|
|
backslash as an escape character applies both inside and
|
|
outside character classes.
|
|
</para>
|
|
<para>
|
|
For example, if you want to match a "*" character, you write
|
|
"\*" in the pattern. This applies whether or not the
|
|
following character would otherwise be interpreted as a
|
|
meta-character, so it is always safe to precede a non-alphanumeric
|
|
with "\" to specify that it stands for itself. In
|
|
particular, if you want to match a backslash, you write "\\".
|
|
</para>
|
|
<para>
|
|
If a pattern is compiled with the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option,
|
|
whitespace in the pattern (other than in a character class) and
|
|
characters between a "#" outside a character class and the
|
|
next newline character are ignored. An escaping backslash
|
|
can be used to include a whitespace or "#" character as part
|
|
of the pattern.
|
|
</para>
|
|
<para>
|
|
A second use of backslash provides a way of encoding
|
|
non-printing characters in patterns in a visible manner. There
|
|
is no restriction on the appearance of non-printing characters,
|
|
apart from the binary zero that terminates a pattern,
|
|
but when a pattern is being prepared by text editing, it is
|
|
usually easier to use one of the following escape sequences
|
|
than the binary character it represents:
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\a</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
alarm, that is, the BEL character (hex 07)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\cx</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
"control-x", where x is any character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\e</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
escape (hex 1B)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\f</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
formfeed (hex 0C)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\n</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
newline (hex 0A)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\r</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
carriage return (hex 0D)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\t</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
tab (hex 09)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\xhh</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
character with hex code hh
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\ddd</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
character with octal code ddd, or backreference
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
The precise effect of "<literal>\cx</literal>" is as follows:
|
|
if "<literal>x</literal>" is a lower case letter, it is converted
|
|
to upper case. Then bit 6 of the character (hex 40) is inverted.
|
|
Thus "<literal>\cz</literal>" becomes hex 1A, but
|
|
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
|
|
becomes hex 7B.
|
|
</para>
|
|
<para>
|
|
After "<literal>\x</literal>", up to two hexadecimal digits are
|
|
read (letters can be in upper or lower case).
|
|
</para>
|
|
<para>
|
|
After "<literal>\0</literal>" up to two further octal digits are read.
|
|
In both cases, if there are fewer than two digits, just those that
|
|
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
|
|
specifies two binary zeros followed by a BEL character. Make sure you
|
|
supply two digits after the initial zero if the character
|
|
that follows is itself an octal digit.
|
|
</para>
|
|
<para>
|
|
The handling of a backslash followed by a digit other than 0
|
|
is complicated. Outside a character class, PCRE reads it
|
|
and any following digits as a decimal number. If the number
|
|
is less than 10, or if there have been at least that many
|
|
previous capturing left parentheses in the expression, the
|
|
entire sequence is taken as a <emphasis>back</emphasis>
|
|
<emphasis>reference</emphasis>. A description
|
|
of how this works is given later, following the discussion
|
|
of parenthesized subpatterns.
|
|
</para>
|
|
<para>
|
|
Inside a character class, or if the decimal number is
|
|
greater than 9 and there have not been that many capturing
|
|
subpatterns, PCRE re-reads up to three octal digits following
|
|
the backslash, and generates a single byte from the
|
|
least significant 8 bits of the value. Any subsequent digits
|
|
stand for themselves. For example:
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\040</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is another way of writing a space
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\40</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is the same, provided there are fewer than 40
|
|
previous capturing subpatterns
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\7</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is always a back reference
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\11</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
might be a back reference, or another way of
|
|
writing a tab
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\011</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is always a tab
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\0113</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is a tab followed by the character "3"
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\113</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is the character with octal code 113 (since there
|
|
can be no more than 99 back references)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\377</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is a byte consisting entirely of 1 bits
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\81</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is either a back reference, or a binary zero
|
|
followed by the two characters "8" and "1"
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Note that octal values of 100 or greater must not be
|
|
introduced by a leading zero, because no more than three octal
|
|
digits are ever read.
|
|
</para>
|
|
<para>
|
|
All the sequences that define a single byte value can be
|
|
used both inside and outside character classes. In addition,
|
|
inside a character class, the sequence "<literal>\b</literal>"
|
|
is interpreted as the backspace character (hex 08). Outside a character
|
|
class it has a different meaning (see below).
|
|
</para>
|
|
<para>
|
|
The third use of backslash is for specifying generic
|
|
character types:
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\d</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
any decimal digit
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\D</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
any character that is not a decimal digit
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\s</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
any whitespace character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\S</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
any character that is not a whitespace character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\w</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
any "word" character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\W</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
any "non-word" character
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Each pair of escape sequences partitions the complete set of
|
|
characters into two disjoint sets. Any given character
|
|
matches one, and only one, of each pair.
|
|
</para>
|
|
<para>
|
|
A "word" character is any letter or digit or the underscore
|
|
character, that is, any character which can be part of a
|
|
Perl "<literal>word</literal>". The definition of letters and digits is
|
|
controlled by PCRE's character tables, and may vary if locale-specific
|
|
matching is taking place (see "Locale support"
|
|
above). For example, in the "fr" (French) locale, some
|
|
character codes greater than 128 are used for accented letters,
|
|
and these are matched by <literal>\w</literal>.
|
|
</para>
|
|
<para>
|
|
These character type sequences can appear both inside and
|
|
outside character classes. They each match one character of
|
|
the appropriate type. If the current matching point is at
|
|
the end of the subject string, all of them fail, since there
|
|
is no character to match.
|
|
</para>
|
|
<para>
|
|
The fourth use of backslash is for certain simple
|
|
assertions. An assertion specifies a condition that has to be met
|
|
at a particular point in a match, without consuming any
|
|
characters from the subject string. The use of subpatterns
|
|
for more complicated assertions is described below. The
|
|
backslashed assertions are
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\b</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
word boundary
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\B</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
not a word boundary
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\A</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
start of subject (independent of multiline mode)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\Z</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
end of subject or newline at end (independent of
|
|
multiline mode)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\z</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
end of subject(independent of multiline mode)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
These assertions may not appear in character classes (but
|
|
note that "<literal>\b</literal>" has a different meaning, namely the backspace
|
|
character, inside a character class).
|
|
</para>
|
|
<para>
|
|
A word boundary is a position in the subject string where
|
|
the current character and the previous character do not both
|
|
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
|
|
<literal>\w</literal> and the other matches
|
|
<literal>\W</literal>), or the start or end of the string if the first
|
|
or last character matches \w, respectively.
|
|
</para>
|
|
<para>
|
|
The <literal>\A</literal>, <literal>\Z</literal>, and
|
|
<literal>\z</literal> assertions differ from the traditional
|
|
circumflex and dollar (described below) in that they only
|
|
ever match at the very start and end of the subject string,
|
|
whatever options are set. They are not affected by the
|
|
<link linkend="pcre.pattern.modifiers">PCRE_NOTBOL</link> or
|
|
<link linkend="pcre.pattern.modifiers">PCRE_NOTEOL</link> options.
|
|
The difference between <literal>\Z</literal> and
|
|
<literal>\z</literal> is that <literal>\Z</literal>
|
|
matches before a newline that is the
|
|
last character of the string as well as at the end of the
|
|
string, whereas <literal>\z</literal> matches only at the end.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.circudollar">
|
|
<title>Circumflex and dollar</title>
|
|
<para>
|
|
Outside a character class, in the default matching mode, the
|
|
circumflex character is an assertion which is true only if
|
|
the current matching point is at the start of the subject
|
|
string. Inside a character class, circumflex has an entirely
|
|
different meaning (see below).
|
|
</para>
|
|
<para>
|
|
Circumflex need not be the first character of the pattern if
|
|
a number of alternatives are involved, but it should be the
|
|
first thing in each alternative in which it appears if the
|
|
pattern is ever to match that branch. If all possible
|
|
alternatives start with a circumflex, that is, if the pattern is
|
|
constrained to match only at the start of the subject, it is
|
|
said to be an "anchored" pattern. (There are also other
|
|
constructs that can cause a pattern to be anchored.)
|
|
</para>
|
|
<para>
|
|
A dollar character is an assertion which is &true; only if the
|
|
current matching point is at the end of the subject string,
|
|
or immediately before a newline character that is the last
|
|
character in the string (by default). Dollar need not be the
|
|
last character of the pattern if a number of alternatives
|
|
are involved, but it should be the last item in any branch
|
|
in which it appears. Dollar has no special meaning in a
|
|
character class.
|
|
</para>
|
|
<para>
|
|
The meaning of dollar can be changed so that it matches only
|
|
at the very end of the string, by setting the
|
|
<link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
|
option at compile or matching time. This
|
|
does not affect the \Z assertion.
|
|
</para>
|
|
<para>
|
|
The meanings of the circumflex and dollar characters are
|
|
changed if the <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> option is set. When this is
|
|
the case, they match immediately after and immediately
|
|
before an internal "\n" character, respectively, in addition
|
|
to matching at the start and end of the subject string. For
|
|
example, the pattern /^abc$/ matches the subject string
|
|
"def\nabc" in multiline mode, but not otherwise.
|
|
Consequently, patterns that are anchored in single line mode
|
|
because all branches start with "^" are not anchored in
|
|
multiline mode. The <link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> option is ignored if
|
|
<link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> is set.
|
|
</para>
|
|
<para>
|
|
Note that the sequences \A, \Z, and \z can be used to match
|
|
the start and end of the subject in both modes, and if all
|
|
branches of a pattern start with \A is it always anchored,
|
|
whether <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> is set or not.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.dot">
|
|
<title>FULL STOP</title>
|
|
<para>
|
|
Outside a character class, a dot in the pattern matches any
|
|
one character in the subject, including a non-printing
|
|
character, but not (by default) newline. If the <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
option is set, then dots match newlines as well. The
|
|
handling of dot is entirely independent of the handling of
|
|
circumflex and dollar, the only relationship being that they
|
|
both involve newline characters. Dot has no special meaning
|
|
in a character class.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.squarebrackets">
|
|
<title>Square brackets</title>
|
|
<para>
|
|
An opening square bracket introduces a character class,
|
|
terminated by a closing square bracket. A closing square
|
|
bracket on its own is not special. If a closing square
|
|
bracket is required as a member of the class, it should be
|
|
the first data character in the class (after an initial
|
|
circumflex, if present) or escaped with a backslash.
|
|
</para>
|
|
<para>
|
|
A character class matches a single character in the subject;
|
|
the character must be in the set of characters defined by
|
|
the class, unless the first character in the class is a
|
|
circumflex, in which case the subject character must not be in
|
|
the set defined by the class. If a circumflex is actually
|
|
required as a member of the class, ensure it is not the
|
|
first character, or escape it with a backslash.
|
|
</para>
|
|
<para>
|
|
For example, the character class [aeiou] matches any lower
|
|
case vowel, while [^aeiou] matches any character that is not
|
|
a lower case vowel. Note that a circumflex is just a
|
|
convenient notation for specifying the characters which are in
|
|
the class by enumerating those that are not. It is not an
|
|
assertion: it still consumes a character from the subject
|
|
string, and fails if the current pointer is at the end of
|
|
the string.
|
|
</para>
|
|
<para>
|
|
When caseless matching is set, any letters in a class
|
|
represent both their upper case and lower case versions, so
|
|
for example, a caseless [aeiou] matches "A" as well as "a",
|
|
and a caseless [^aeiou] does not match "A", whereas a
|
|
caseful version would.
|
|
</para>
|
|
<para>
|
|
The newline character is never treated in any special way in
|
|
character classes, whatever the setting of the <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
or <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> options is. A class such as [^a] will
|
|
always match a newline.
|
|
</para>
|
|
<para>
|
|
The minus (hyphen) character can be used to specify a range
|
|
of characters in a character class. For example, [d-m]
|
|
matches any letter between d and m, inclusive. If a minus
|
|
character is required in a class, it must be escaped with a
|
|
backslash or appear in a position where it cannot be
|
|
interpreted as indicating a range, typically as the first or last
|
|
character in the class.
|
|
</para>
|
|
<para>
|
|
It is not possible to have the literal character "]" as the
|
|
end character of a range. A pattern such as [W-]46] is
|
|
interpreted as a class of two characters ("W" and "-")
|
|
followed by a literal string "46]", so it would match "W46]" or
|
|
"-46]". However, if the "]" is escaped with a backslash it
|
|
is interpreted as the end of range, so [W-\]46] is
|
|
interpreted as a single class containing a range followed by two
|
|
separate characters. The octal or hexadecimal representation
|
|
of "]" can also be used to end a range.
|
|
</para>
|
|
<para>
|
|
Ranges operate in ASCII collating sequence. They can also be
|
|
used for characters specified numerically, for example
|
|
[\000-\037]. If a range that includes letters is used when
|
|
caseless matching is set, it matches the letters in either
|
|
case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
|
|
matched caselessly, and if character tables for the "fr"
|
|
locale are in use, [\xc8-\xcb] matches accented E characters
|
|
in both cases.
|
|
</para>
|
|
<para>
|
|
The character types \d, \D, \s, \S, \w, and \W may also
|
|
appear in a character class, and add the characters that
|
|
they match to the class. For example, [\dABCDEF] matches any
|
|
hexadecimal digit. A circumflex can conveniently be used
|
|
with the upper case character types to specify a more
|
|
restricted set of characters than the matching lower case type.
|
|
For example, the class [^\W_] matches any letter or digit,
|
|
but not underscore.
|
|
</para>
|
|
<para>
|
|
All non-alphanumeric characters other than \, -, ^ (at the
|
|
start) and the terminating ] are non-special in character
|
|
classes, but it does no harm if they are escaped.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.verticalbar">
|
|
<title>Vertical bar</title>
|
|
<para>
|
|
Vertical bar characters are used to separate alternative
|
|
patterns. For example, the pattern
|
|
|
|
<literal>gilbert|sullivan</literal>
|
|
|
|
matches either "gilbert" or "sullivan". Any number of alternatives
|
|
may appear, and an empty alternative is permitted
|
|
(matching the empty string). The matching process tries
|
|
each alternative in turn, from left to right, and the first
|
|
one that succeeds is used. If the alternatives are within a
|
|
subpattern (defined below), "succeeds" means matching the
|
|
rest of the main pattern as well as the alternative in the
|
|
subpattern.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.internal-options">
|
|
<title>Internal option setting</title>
|
|
<para>
|
|
The settings of <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link>,
|
|
<link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>,
|
|
<link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>,
|
|
and <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> can be changed from within the pattern by
|
|
a sequence of Perl option letters enclosed between "(?" and
|
|
")". The option letters are
|
|
|
|
<table>
|
|
<title>Internal option letters</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>i</literal></entry>
|
|
<entry>for <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>m</literal></entry>
|
|
<entry>for <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>s</literal></entry>
|
|
<entry>for <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>x</literal></entry>
|
|
<entry>for <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</para>
|
|
<para>
|
|
For example, (?im) sets caseless, multiline matching. It is
|
|
also possible to unset these options by preceding the letter
|
|
with a hyphen, and a combined setting and unsetting such as
|
|
(?im-sx), which sets <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> and <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> while
|
|
unsetting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> and <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>, is also permitted.
|
|
If a letter appears both before and after the hyphen, the
|
|
option is unset.
|
|
</para>
|
|
<para>
|
|
The scope of these option changes depends on where in the
|
|
pattern the setting occurs. For settings that are outside
|
|
any subpattern (defined below), the effect is the same as if
|
|
the options were set or unset at the start of matching. The
|
|
following patterns all behave in exactly the same way:
|
|
</para>
|
|
|
|
<literallayout>
|
|
(?i)abc
|
|
a(?i)bc
|
|
ab(?i)c
|
|
abc(?i)
|
|
</literallayout>
|
|
|
|
<para>
|
|
which in turn is the same as compiling the pattern abc with
|
|
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set.
|
|
In other words, such "top level" settings apply to the whole
|
|
pattern (unless there are other changes inside subpatterns).
|
|
If there is more than one setting of the same option at top level,
|
|
the rightmost setting is used.
|
|
</para>
|
|
<para>
|
|
If an option change occurs inside a subpattern, the effect
|
|
is different. This is a change of behaviour in Perl 5.005.
|
|
An option change inside a subpattern affects only that part
|
|
of the subpattern that follows it, so
|
|
|
|
<literal>(a(?i)b)c</literal>
|
|
|
|
matches abc and aBc and no other strings (assuming
|
|
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> is not used). By this means, options can be
|
|
made to have different settings in different parts of the
|
|
pattern. Any changes made in one alternative do carry on
|
|
into subsequent branches within the same subpattern. For
|
|
example,
|
|
|
|
<literal>(a(?i)b|c)</literal>
|
|
|
|
matches "ab", "aB", "c", and "C", even though when matching
|
|
"C" the first branch is abandoned before the option setting.
|
|
This is because the effects of option settings happen at
|
|
compile time. There would be some very weird behaviour otherwise.
|
|
</para>
|
|
<para>
|
|
The PCRE-specific options <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
|
|
<link linkend="pcre.pattern.modifiers">PCRE_EXTRA</link> can
|
|
be changed in the same way as the Perl-compatible options by
|
|
using the characters U and X respectively. The (?X) flag
|
|
setting is special in that it must always occur earlier in
|
|
the pattern than any of the additional features it turns on,
|
|
even when it is at top level. It is best put at the start.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.subpatterns">
|
|
<title>subpatterns</title>
|
|
<para>
|
|
Subpatterns are delimited by parentheses (round brackets),
|
|
which can be nested. Marking part of a pattern as a subpattern
|
|
does two things:
|
|
</para>
|
|
<para>
|
|
1. It localizes a set of alternatives. For example, the
|
|
pattern
|
|
|
|
<literal>cat(aract|erpillar|)</literal>
|
|
|
|
matches one of the words "cat", "cataract", or "caterpillar".
|
|
Without the parentheses, it would match "cataract",
|
|
"erpillar" or the empty string.
|
|
</para>
|
|
<para>
|
|
2. It sets up the subpattern as a capturing subpattern (as
|
|
defined above). When the whole pattern matches, that portion
|
|
of the subject string that matched the subpattern is
|
|
passed back to the caller via the <emphasis>ovector</emphasis>
|
|
argument of
|
|
<function>pcre_exec</function>. Opening parentheses are counted
|
|
from left to right (starting from 1) to obtain the numbers of the
|
|
capturing subpatterns.
|
|
</para>
|
|
<para>
|
|
For example, if the string "the red king" is matched against
|
|
the pattern
|
|
|
|
<literal>the ((red|white) (king|queen))</literal>
|
|
|
|
the captured substrings are "red king", "red", and "king",
|
|
and are numbered 1, 2, and 3.
|
|
</para>
|
|
<para>
|
|
The fact that plain parentheses fulfil two functions is not
|
|
always helpful. There are often times when a grouping subpattern
|
|
is required without a capturing requirement. If an
|
|
opening parenthesis is followed by "?:", the subpattern does
|
|
not do any capturing, and is not counted when computing the
|
|
number of any subsequent capturing subpatterns. For example,
|
|
if the string "the white queen" is matched against the
|
|
pattern
|
|
|
|
<literal>the ((?:red|white) (king|queen))</literal>
|
|
|
|
the captured substrings are "white queen" and "queen", and
|
|
are numbered 1 and 2. The maximum number of captured substrings
|
|
is 99, and the maximum number of all subpatterns,
|
|
both capturing and non-capturing, is 200.
|
|
</para>
|
|
<para>
|
|
As a convenient shorthand, if any option settings are
|
|
required at the start of a non-capturing subpattern, the
|
|
option letters may appear between the "?" and the ":". Thus
|
|
the two patterns
|
|
</para>
|
|
|
|
<literallayout>
|
|
(?i:saturday|sunday)
|
|
(?:(?i)saturday|sunday)
|
|
</literallayout>
|
|
|
|
<para>
|
|
match exactly the same set of strings. Because alternative
|
|
branches are tried from left to right, and options are not
|
|
reset until the end of the subpattern is reached, an option
|
|
setting in one branch does affect subsequent branches, so
|
|
the above patterns match "SUNDAY" as well as "Saturday".
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.repetition">
|
|
<title>Repetition</title>
|
|
<para>
|
|
Repetition is specified by quantifiers, which can follow any
|
|
of the following items:
|
|
|
|
<itemizedlist>
|
|
<listitem><simpara>a single character, possibly escaped</simpara></listitem>
|
|
<listitem><simpara>the . metacharacter</simpara></listitem>
|
|
<listitem><simpara>a character class</simpara></listitem>
|
|
<listitem><simpara>a back reference (see next section)</simpara></listitem>
|
|
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
|
|
see below)</simpara></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The general repetition quantifier specifies a minimum and
|
|
maximum number of permitted matches, by giving the two
|
|
numbers in curly brackets (braces), separated by a comma.
|
|
The numbers must be less than 65536, and the first must be
|
|
less than or equal to the second. For example:
|
|
|
|
<literal>z{2,4}</literal>
|
|
|
|
matches "zz", "zzz", or "zzzz". A closing brace on its own
|
|
is not a special character. If the second number is omitted,
|
|
but the comma is present, there is no upper limit; if the
|
|
second number and the comma are both omitted, the quantifier
|
|
specifies an exact number of required matches. Thus
|
|
|
|
<literal>[aeiou]{3,}</literal>
|
|
|
|
matches at least 3 successive vowels, but may match many
|
|
more, while
|
|
|
|
<literal>\d{8}</literal>
|
|
|
|
matches exactly 8 digits. An opening curly bracket that
|
|
appears in a position where a quantifier is not allowed, or
|
|
one that does not match the syntax of a quantifier, is taken
|
|
as a literal character. For example, {,6} is not a quantifier,
|
|
but a literal string of four characters.
|
|
</para>
|
|
<para>
|
|
The quantifier {0} is permitted, causing the expression to
|
|
behave as if the previous item and the quantifier were not
|
|
present.
|
|
</para>
|
|
<para>
|
|
For convenience (and historical compatibility) the three
|
|
most common quantifiers have single-character abbreviations:
|
|
|
|
<table>
|
|
<title>Single-character quantifiers</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>*</literal></entry>
|
|
<entry>equivalent to <literal>{0,}</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>+</literal></entry>
|
|
<entry>equivalent to <literal>{1,}</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>?</literal></entry>
|
|
<entry>equivalent to <literal>{0,1}</literal></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</para>
|
|
<para>
|
|
It is possible to construct infinite loops by following a
|
|
subpattern that can match no characters with a quantifier
|
|
that has no upper limit, for example:
|
|
|
|
<literal>(a?)*</literal>
|
|
</para>
|
|
<para>
|
|
Earlier versions of Perl and PCRE used to give an error at
|
|
compile time for such patterns. However, because there are
|
|
cases where this can be useful, such patterns are now
|
|
accepted, but if any repetition of the subpattern does in
|
|
fact match no characters, the loop is forcibly broken.
|
|
</para>
|
|
<para>
|
|
By default, the quantifiers are "greedy", that is, they
|
|
match as much as possible (up to the maximum number of permitted
|
|
times), without causing the rest of the pattern to
|
|
fail. The classic example of where this gives problems is in
|
|
trying to match comments in C programs. These appear between
|
|
the sequences /* and */ and within the sequence, individual
|
|
* and / characters may appear. An attempt to match C comments
|
|
by applying the pattern
|
|
|
|
<literal>/\*.*\*/</literal>
|
|
|
|
to the string
|
|
|
|
<literal>/* first command */ not comment /* second comment */</literal>
|
|
|
|
fails, because it matches the entire string due to the
|
|
greediness of the .* item.
|
|
</para>
|
|
<para>
|
|
However, if a quantifier is followed by a question mark,
|
|
then it ceases to be greedy, and instead matches the minimum
|
|
number of times possible, so the pattern
|
|
|
|
<literal>/\*.*?\*/</literal>
|
|
|
|
does the right thing with the C comments. The meaning of the
|
|
various quantifiers is not otherwise changed, just the preferred
|
|
number of matches. Do not confuse this use of
|
|
question mark with its use as a quantifier in its own right.
|
|
Because it has two uses, it can sometimes appear doubled, as
|
|
in
|
|
|
|
<literal>\d??\d</literal>
|
|
|
|
which matches one digit by preference, but can match two if
|
|
that is the only way the rest of the pattern matches.
|
|
</para>
|
|
<para>
|
|
If the <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> option is set (an option which is not
|
|
available in Perl) then the quantifiers are not greedy by
|
|
default, but individual ones can be made greedy by following
|
|
them with a question mark. In other words, it inverts the
|
|
default behaviour.
|
|
</para>
|
|
<para>
|
|
When a parenthesized subpattern is quantified with a minimum
|
|
repeat count that is greater than 1 or with a limited maximum,
|
|
more store is required for the compiled pattern, in
|
|
proportion to the size of the minimum or maximum.
|
|
</para>
|
|
<para>
|
|
If a pattern starts with .* or .{0,} and the <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
option (equivalent to Perl's /s) is set, thus allowing the .
|
|
to match newlines, then the pattern is implicitly anchored,
|
|
because whatever follows will be tried against every character
|
|
position in the subject string, so there is no point in
|
|
retrying the overall match at any position after the first.
|
|
PCRE treats such a pattern as though it were preceded by \A.
|
|
In cases where it is known that the subject string contains
|
|
no newlines, it is worth setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> when the pattern begins with .* in order to
|
|
obtain this optimization, or
|
|
alternatively using ^ to indicate anchoring explicitly.
|
|
</para>
|
|
<para>
|
|
When a capturing subpattern is repeated, the value captured
|
|
is the substring that matched the final iteration. For example, after
|
|
|
|
<literal>(tweedle[dume]{3}\s*)+</literal>
|
|
|
|
has matched "tweedledum tweedledee" the value of the captured
|
|
substring is "tweedledee". However, if there are
|
|
nested capturing subpatterns, the corresponding captured
|
|
values may have been set in previous iterations. For example,
|
|
after
|
|
|
|
<literal>/(a|(b))+/</literal>
|
|
|
|
matches "aba" the value of the second captured substring is
|
|
"b".
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.back-references">
|
|
<title>BACK REFERENCES</title>
|
|
<para>
|
|
Outside a character class, a backslash followed by a digit
|
|
greater than 0 (and possibly further digits) is a back
|
|
reference to a capturing subpattern earlier (i.e. to its
|
|
left) in the pattern, provided there have been that many
|
|
previous capturing left parentheses.
|
|
</para>
|
|
<para>
|
|
However, if the decimal number following the backslash is
|
|
less than 10, it is always taken as a back reference, and
|
|
causes an error only if there are not that many capturing
|
|
left parentheses in the entire pattern. In other words, the
|
|
parentheses that are referenced need not be to the left of
|
|
the reference for numbers less than 10. See the section
|
|
entitled "Backslash" above for further details of the handling
|
|
of digits following a backslash.
|
|
</para>
|
|
<para>
|
|
A back reference matches whatever actually matched the capturing
|
|
subpattern in the current subject string, rather than
|
|
anything matching the subpattern itself. So the pattern
|
|
|
|
<literal>(sens|respons)e and \1ibility</literal>
|
|
|
|
matches "sense and sensibility" and "response and responsibility",
|
|
but not "sense and responsibility". If caseful
|
|
matching is in force at the time of the back reference, then
|
|
the case of letters is relevant. For example,
|
|
|
|
<literal>((?i)rah)\s+\1</literal>
|
|
|
|
matches "rah rah" and "RAH RAH", but not "RAH rah", even
|
|
though the original capturing subpattern is matched caselessly.
|
|
</para>
|
|
<para>
|
|
There may be more than one back reference to the same subpattern.
|
|
If a subpattern has not actually been used in a
|
|
particular match, then any back references to it always
|
|
fail. For example, the pattern
|
|
|
|
<literal>(a|(bc))\2</literal>
|
|
|
|
always fails if it starts to match "a" rather than "bc".
|
|
Because there may be up to 99 back references, all digits
|
|
following the backslash are taken as part of a potential
|
|
back reference number. If the pattern continues with a digit
|
|
character, then some delimiter must be used to terminate the
|
|
back reference. If the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option is set, this can
|
|
be whitespace. Otherwise an empty comment can be used.
|
|
</para>
|
|
<para>
|
|
A back reference that occurs inside the parentheses to which
|
|
it refers fails when the subpattern is first used, so, for
|
|
example, (a\1) never matches. However, such references can
|
|
be useful inside repeated subpatterns. For example, the pattern
|
|
|
|
<literal>(a|b\1)+</literal>
|
|
|
|
matches any number of "a"s and also "aba", "ababaa" etc. At
|
|
each iteration of the subpattern, the back reference matches
|
|
the character string corresponding to the previous iteration.
|
|
In order for this to work, the pattern must be such
|
|
that the first iteration does not need to match the back
|
|
reference. This can be done using alternation, as in the
|
|
example above, or by a quantifier with a minimum of zero.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.assertions">
|
|
<title>Assertions</title>
|
|
<para>
|
|
An assertion is a test on the characters following or
|
|
preceding the current matching point that does not actually
|
|
consume any characters. The simple assertions coded as \b,
|
|
\B, \A, \Z, \z, ^ and $ are described above. More complicated
|
|
assertions are coded as subpatterns. There are two
|
|
kinds: those that look ahead of the current position in the
|
|
subject string, and those that look behind it.
|
|
</para>
|
|
<para>
|
|
An assertion subpattern is matched in the normal way, except
|
|
that it does not cause the current matching position to be
|
|
changed. Lookahead assertions start with (?= for positive
|
|
assertions and (?! for negative assertions. For example,
|
|
|
|
<literal>\w+(?=;)</literal>
|
|
|
|
matches a word followed by a semicolon, but does not include
|
|
the semicolon in the match, and
|
|
|
|
<literal>foo(?!bar)</literal>
|
|
|
|
matches any occurrence of "foo" that is not followed by
|
|
"bar". Note that the apparently similar pattern
|
|
|
|
<literal>(?!foo)bar</literal>
|
|
|
|
does not find an occurrence of "bar" that is preceded by
|
|
something other than "foo"; it finds any occurrence of "bar"
|
|
whatsoever, because the assertion (?!foo) is always &true;
|
|
when the next three characters are "bar". A lookbehind
|
|
assertion is needed to achieve this effect.
|
|
</para>
|
|
<para>
|
|
Lookbehind assertions start with (?<= for positive assertions
|
|
and (?<! for negative assertions. For example,
|
|
|
|
<literal>(?<!foo)bar</literal>
|
|
|
|
does find an occurrence of "bar" that is not preceded by
|
|
"foo". The contents of a lookbehind assertion are restricted
|
|
such that all the strings it matches must have a fixed
|
|
length. However, if there are several alternatives, they do
|
|
not all have to have the same fixed length. Thus
|
|
|
|
<literal>(?<=bullock|donkey)</literal>
|
|
|
|
is permitted, but
|
|
|
|
<literal>(?<!dogs?|cats?)</literal>
|
|
|
|
causes an error at compile time. Branches that match different
|
|
length strings are permitted only at the top level of
|
|
a lookbehind assertion. This is an extension compared with
|
|
Perl 5.005, which requires all branches to match the same
|
|
length of string. An assertion such as
|
|
|
|
<literal>(?<=ab(c|de))</literal>
|
|
|
|
is not permitted, because its single top-level branch can
|
|
match two different lengths, but it is acceptable if rewritten
|
|
to use two top-level branches:
|
|
|
|
<literal>(?<=abc|abde)</literal>
|
|
|
|
The implementation of lookbehind assertions is, for each
|
|
alternative, to temporarily move the current position back
|
|
by the fixed width and then try to match. If there are
|
|
insufficient characters before the current position, the
|
|
match is deemed to fail. Lookbehinds in conjunction with
|
|
once-only subpatterns can be particularly useful for matching
|
|
at the ends of strings; an example is given at the end
|
|
of the section on once-only subpatterns.
|
|
</para>
|
|
<para>
|
|
Several assertions (of any sort) may occur in succession.
|
|
For example,
|
|
|
|
<literal>(?<=\d{3})(?<!999)foo</literal>
|
|
|
|
matches "foo" preceded by three digits that are not "999".
|
|
Notice that each of the assertions is applied independently
|
|
at the same point in the subject string. First there is a
|
|
check that the previous three characters are all digits,
|
|
then there is a check that the same three characters are not
|
|
"999". This pattern does not match "foo" preceded by six
|
|
characters, the first of which are digits and the last three
|
|
of which are not "999". For example, it doesn't match
|
|
"123abcfoo". A pattern to do that is
|
|
|
|
<literal>(?<=\d{3}...)(?<!999)foo</literal>
|
|
</para>
|
|
<para>
|
|
This time the first assertion looks at the preceding six
|
|
characters, checking that the first three are digits, and
|
|
then the second assertion checks that the preceding three
|
|
characters are not "999".
|
|
</para>
|
|
<para>
|
|
Assertions can be nested in any combination. For example,
|
|
|
|
<literal>(?<=(?<!foo)bar)baz</literal>
|
|
|
|
matches an occurrence of "baz" that is preceded by "bar"
|
|
which in turn is not preceded by "foo", while
|
|
|
|
<literal>(?<=\d{3}(?!999)...)foo</literal>
|
|
|
|
is another pattern which matches "foo" preceded by three
|
|
digits and any three characters that are not "999".
|
|
</para>
|
|
<para>
|
|
Assertion subpatterns are not capturing subpatterns, and may
|
|
not be repeated, because it makes no sense to assert the
|
|
same thing several times. If any kind of assertion contains
|
|
capturing subpatterns within it, these are counted for the
|
|
purposes of numbering the capturing subpatterns in the whole
|
|
pattern. However, substring capturing is carried out only
|
|
for positive assertions, because it does not make sense for
|
|
negative assertions.
|
|
</para>
|
|
<para>
|
|
Assertions count towards the maximum of 200 parenthesized
|
|
subpatterns.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.onlyonce">
|
|
<title>Once-only subpatterns</title>
|
|
<para>
|
|
With both maximizing and minimizing repetition, failure of
|
|
what follows normally causes the repeated item to be
|
|
re-evaluated to see if a different number of repeats allows the
|
|
rest of the pattern to match. Sometimes it is useful to
|
|
prevent this, either to change the nature of the match, or
|
|
to cause it fail earlier than it otherwise might, when the
|
|
author of the pattern knows there is no point in carrying
|
|
on.
|
|
</para>
|
|
<para>
|
|
Consider, for example, the pattern \d+foo when applied to
|
|
the subject line
|
|
|
|
<literal>123456bar</literal>
|
|
</para>
|
|
<para>
|
|
After matching all 6 digits and then failing to match "foo",
|
|
the normal action of the matcher is to try again with only 5
|
|
digits matching the \d+ item, and then with 4, and so on,
|
|
before ultimately failing. Once-only subpatterns provide the
|
|
means for specifying that once a portion of the pattern has
|
|
matched, it is not to be re-evaluated in this way, so the
|
|
matcher would give up immediately on failing to match "foo"
|
|
the first time. The notation is another kind of special
|
|
parenthesis, starting with (?> as in this example:
|
|
|
|
<literal>(?>\d+)bar</literal>
|
|
</para>
|
|
<para>
|
|
This kind of parenthesis "locks up" the part of the pattern
|
|
it contains once it has matched, and a failure further into
|
|
the pattern is prevented from backtracking into it.
|
|
Backtracking past it to previous items, however, works as normal.
|
|
</para>
|
|
<para>
|
|
An alternative description is that a subpattern of this type
|
|
matches the string of characters that an identical standalone
|
|
pattern would match, if anchored at the current point
|
|
in the subject string.
|
|
</para>
|
|
<para>
|
|
Once-only subpatterns are not capturing subpatterns. Simple
|
|
cases such as the above example can be thought of as a maximizing
|
|
repeat that must swallow everything it can. So,
|
|
while both \d+ and \d+? are prepared to adjust the number of
|
|
digits they match in order to make the rest of the pattern
|
|
match, (?>\d+) can only match an entire sequence of digits.
|
|
</para>
|
|
<para>
|
|
This construction can of course contain arbitrarily complicated
|
|
subpatterns, and it can be nested.
|
|
</para>
|
|
<para>
|
|
Once-only subpatterns can be used in conjunction with
|
|
look-behind assertions to specify efficient matching at the end
|
|
of the subject string. Consider a simple pattern such as
|
|
|
|
<literal>abcd$</literal>
|
|
|
|
when applied to a long string which does not match. Because
|
|
matching proceeds from left to right, PCRE will look for
|
|
each "a" in the subject and then see if what follows matches
|
|
the rest of the pattern. If the pattern is specified as
|
|
|
|
<literal>^.*abcd$</literal>
|
|
|
|
then the initial .* matches the entire string at first, but
|
|
when this fails (because there is no following "a"), it
|
|
backtracks to match all but the last character, then all but
|
|
the last two characters, and so on. Once again the search
|
|
for "a" covers the entire string, from right to left, so we
|
|
are no better off. However, if the pattern is written as
|
|
|
|
<literal>^(?>.*)(?<=abcd)</literal>
|
|
|
|
then there can be no backtracking for the .* item; it can
|
|
match only the entire string. The subsequent lookbehind
|
|
assertion does a single test on the last four characters. If
|
|
it fails, the match fails immediately. For long strings,
|
|
this approach makes a significant difference to the processing time.
|
|
</para>
|
|
<para>
|
|
When a pattern contains an unlimited repeat inside a subpattern
|
|
that can itself be repeated an unlimited number of
|
|
times, the use of a once-only subpattern is the only way to
|
|
avoid some failing matches taking a very long time indeed.
|
|
The pattern
|
|
|
|
<literal>(\D+|<\d+>)*[!?]</literal>
|
|
|
|
matches an unlimited number of substrings that either consist
|
|
of non-digits, or digits enclosed in <>, followed by
|
|
either ! or ?. When it matches, it runs quickly. However, if
|
|
it is applied to
|
|
|
|
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
|
|
|
|
it takes a long time before reporting failure. This is
|
|
because the string can be divided between the two repeats in
|
|
a large number of ways, and all have to be tried. (The example
|
|
used [!?] rather than a single character at the end,
|
|
because both PCRE and Perl have an optimization that allows
|
|
for fast failure when a single character is used. They
|
|
remember the last single character that is required for a
|
|
match, and fail early if it is not present in the string.)
|
|
If the pattern is changed to
|
|
|
|
<literal>((?>\D+)|<\d+>)*[!?]</literal>
|
|
|
|
sequences of non-digits cannot be broken, and failure happens quickly.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.conditional">
|
|
<title>Conditional subpatterns</title>
|
|
<para>
|
|
It is possible to cause the matching process to obey a subpattern
|
|
conditionally or to choose between two alternative
|
|
subpatterns, depending on the result of an assertion, or
|
|
whether a previous capturing subpattern matched or not. The
|
|
two possible forms of conditional subpattern are
|
|
</para>
|
|
|
|
<literallayout>
|
|
(?(condition)yes-pattern)
|
|
(?(condition)yes-pattern|no-pattern)
|
|
</literallayout>
|
|
<para>
|
|
If the condition is satisfied, the yes-pattern is used; otherwise
|
|
the no-pattern (if present) is used. If there are
|
|
more than two alternatives in the subpattern, a compile-time
|
|
error occurs.
|
|
</para>
|
|
<para>
|
|
There are two kinds of condition. If the text between the
|
|
parentheses consists of a sequence of digits, then the
|
|
condition is satisfied if the capturing subpattern of that
|
|
number has previously matched. Consider the following pattern,
|
|
which contains non-significant white space to make it
|
|
more readable (assume the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option) and to
|
|
divide it into three parts for ease of discussion:
|
|
|
|
<literal>( \( )? [^()]+ (?(1) \) )</literal>
|
|
</para>
|
|
<para>
|
|
The first part matches an optional opening parenthesis, and
|
|
if that character is present, sets it as the first captured
|
|
substring. The second part matches one or more characters
|
|
that are not parentheses. The third part is a conditional
|
|
subpattern that tests whether the first set of parentheses
|
|
matched or not. If they did, that is, if subject started
|
|
with an opening parenthesis, the condition is &true;, and so
|
|
the yes-pattern is executed and a closing parenthesis is
|
|
required. Otherwise, since no-pattern is not present, the
|
|
subpattern matches nothing. In other words, this pattern
|
|
matches a sequence of non-parentheses, optionally enclosed
|
|
in parentheses.
|
|
</para>
|
|
<para>
|
|
If the condition is not a sequence of digits, it must be an
|
|
assertion. This may be a positive or negative lookahead or
|
|
lookbehind assertion. Consider this pattern, again containing
|
|
non-significant white space, and with the two alternatives on
|
|
the second line:
|
|
</para>
|
|
|
|
<literallayout>
|
|
(?(?=[^a-z]*[a-z])
|
|
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
|
</literallayout>
|
|
<para>
|
|
The condition is a positive lookahead assertion that matches
|
|
an optional sequence of non-letters followed by a letter. In
|
|
other words, it tests for the presence of at least one
|
|
letter in the subject. If a letter is found, the subject is
|
|
matched against the first alternative; otherwise it is
|
|
matched against the second. This pattern matches strings in
|
|
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
|
|
letters and dd are digits.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.comments">
|
|
<title>Comments</title>
|
|
<para>
|
|
The sequence (?# marks the start of a comment which
|
|
continues up to the next closing parenthesis. Nested
|
|
parentheses are not permitted. The characters that make up a
|
|
comment play no part in the pattern matching at all.
|
|
</para>
|
|
<para>
|
|
If the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option is set, an unescaped # character
|
|
outside a character class introduces a comment that
|
|
continues up to the next newline character in the pattern.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.recursive">
|
|
<title>Recursive patterns</title>
|
|
<para>
|
|
Consider the problem of matching a string in parentheses,
|
|
allowing for unlimited nested parentheses. Without the use
|
|
of recursion, the best that can be done is to use a pattern
|
|
that matches up to some fixed depth of nesting. It is not
|
|
possible to handle an arbitrary nesting depth. Perl 5.6 has
|
|
provided an experimental facility that allows regular
|
|
expressions to recurse (among other things). The special
|
|
item (?R) is provided for the specific case of recursion.
|
|
This PCRE pattern solves the parentheses problem (assume
|
|
the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>
|
|
option is set so that white space is
|
|
ignored):
|
|
|
|
<literal>\( ( (?>[^()]+) | (?R) )* \)</literal>
|
|
</para>
|
|
<para>
|
|
First it matches an opening parenthesis. Then it matches any
|
|
number of substrings which can either be a sequence of
|
|
non-parentheses, or a recursive match of the pattern itself
|
|
(i.e. a correctly parenthesized substring). Finally there is
|
|
a closing parenthesis.
|
|
</para>
|
|
<para>
|
|
This particular example pattern contains nested unlimited
|
|
repeats, and so the use of a once-only subpattern for matching
|
|
strings of non-parentheses is important when applying
|
|
the pattern to strings that do not match. For example, when
|
|
it is applied to
|
|
|
|
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
|
|
|
|
it yields "no match" quickly. However, if a once-only subpattern
|
|
is not used, the match runs for a very long time
|
|
indeed because there are so many different ways the + and *
|
|
repeats can carve up the subject, and all have to be tested
|
|
before failure can be reported.
|
|
</para>
|
|
<para>
|
|
The values set for any capturing subpatterns are those from
|
|
the outermost level of the recursion at which the subpattern
|
|
value is set. If the pattern above is matched against
|
|
|
|
<literal>(ab(cd)ef)</literal>
|
|
|
|
the value for the capturing parentheses is "ef", which is
|
|
the last value taken on at the top level. If additional
|
|
parentheses are added, giving
|
|
|
|
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
|
|
then the string they capture
|
|
is "ab(cd)ef", the contents of the top level parentheses. If
|
|
there are more than 15 capturing parentheses in a pattern,
|
|
PCRE has to obtain extra memory to store data during a
|
|
recursion, which it does by using pcre_malloc, freeing it
|
|
via pcre_free afterwards. If no memory can be obtained, it
|
|
saves data for the first 15 capturing parentheses only, as
|
|
there is no way to give an out-of-memory error from within a
|
|
recursion.
|
|
</para>
|
|
</refsect2>
|
|
|
|
<refsect2 id="regexp.reference.performances">
|
|
<title>Performances</title>
|
|
<para>
|
|
Certain items that may appear in patterns are more efficient
|
|
than others. It is more efficient to use a character class
|
|
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
|
|
In general, the simplest construction that provides the
|
|
required behaviour is usually the most efficient. Jeffrey
|
|
Friedl's book contains a lot of discussion about optimizing
|
|
regular expressions for efficient performance.
|
|
</para>
|
|
<para>
|
|
When a pattern begins with .* and the <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> option is
|
|
set, the pattern is implicitly anchored by PCRE, since it
|
|
can match only at the start of a subject string. However, if
|
|
<link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> is not set, PCRE cannot make this optimization,
|
|
because the . metacharacter does not then match a newline,
|
|
and if the subject string contains newlines, the pattern may
|
|
match from the character immediately following one of them
|
|
instead of from the very start. For example, the pattern
|
|
|
|
<literal>(.*) second</literal>
|
|
|
|
matches the subject "first\nand second" (where \n stands for
|
|
a newline character) with the first captured substring being
|
|
"and". In order to do this, PCRE has to retry the match
|
|
starting after every newline in the subject.
|
|
</para>
|
|
<para>
|
|
If you are using such a pattern with subject strings that do
|
|
not contain newlines, the best performance is obtained by
|
|
setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>, or starting the pattern with ^.* to
|
|
indicate explicit anchoring. That saves PCRE from having to
|
|
scan along the subject looking for a newline to restart at.
|
|
</para>
|
|
<para>
|
|
Beware of patterns that contain nested indefinite repeats.
|
|
These can take a long time to run when applied to a string
|
|
that does not match. Consider the pattern fragment
|
|
|
|
<literal>(a+)*</literal>
|
|
</para>
|
|
<para>
|
|
This can match "aaaa" in 33 different ways, and this number
|
|
increases very rapidly as the string gets longer. (The *
|
|
repeat can match 0, 1, 2, 3, or 4 times, and for each of
|
|
those cases other than 0, the + repeats can match different
|
|
numbers of times.) When the remainder of the pattern is such
|
|
that the entire match is going to fail, PCRE has in principle
|
|
to try every possible variation, and this can take an
|
|
extremely long time.
|
|
</para>
|
|
<para>
|
|
An optimization catches some of the more simple cases such
|
|
as
|
|
|
|
<literal>(a+)*b</literal>
|
|
|
|
where a literal character follows. Before embarking on the
|
|
standard matching procedure, PCRE checks that there is a "b"
|
|
later in the subject string, and if there is not, it fails
|
|
the match immediately. However, when there is no following
|
|
literal this optimization cannot be used. You can see the
|
|
difference by comparing the behaviour of
|
|
|
|
<literal>(a+)*\d</literal>
|
|
|
|
with the pattern above. The former gives a failure almost
|
|
instantly when applied to a whole line of "a" characters,
|
|
whereas the latter takes an appreciable time with strings
|
|
longer than about 20 characters.
|
|
</para>
|
|
</refsect2>
|
|
</refsect1>
|
|
</refentry>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode: sgml
|
|
sgml-omittag:t
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
indent-tabs-mode:nil
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"../../../../manual.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:nil
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
vim600: syn=xml fen fdm=syntax fdl=2 si
|
|
vim: et tw=78 syn=sgml
|
|
vi: ts=1 sw=1
|
|
-->
|