mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-19 10:28:54 +00:00

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@328365 c90b9560-bf6c-de11-be94-00142212c4b1
2310 lines
88 KiB
XML
2310 lines
88 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
|
<!-- $Revision$ -->
|
|
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
|
|
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">
|
|
<title>Pattern Syntax</title>
|
|
<titleabbrev>PCRE regex syntax</titleabbrev>
|
|
|
|
<section xml:id="regexp.introduction">
|
|
<title>Introduction</title>
|
|
<para>
|
|
The syntax and semantics of the regular expressions
|
|
supported by PCRE are described below. Regular expressions are
|
|
also described in the Perl documentation and in a number of
|
|
other books, some of which have copious examples. Jeffrey
|
|
Friedl's "Mastering Regular Expressions", published by
|
|
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
|
|
The description here is intended as reference documentation.
|
|
</para>
|
|
<para>
|
|
A regular expression is a pattern that is matched against a
|
|
subject string from left to right. Most characters stand for
|
|
themselves in a pattern, and match the corresponding
|
|
characters in the subject. As a trivial example, the pattern
|
|
<literal>The quick brown fox</literal>
|
|
matches a portion of a subject string that is identical to
|
|
itself.
|
|
</para>
|
|
</section>
|
|
<section xml:id="regexp.reference.delimiters">
|
|
<title>Delimiters</title>
|
|
<para>
|
|
When using the PCRE functions, it is required that the pattern is enclosed
|
|
by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,
|
|
non-backslash, non-whitespace character.
|
|
</para>
|
|
<para>
|
|
Often used delimiters are forward slashes (<literal>/</literal>), hash
|
|
signs (<literal>#</literal>) and tildes (<literal>~</literal>). The
|
|
following are all examples of valid delimited patterns.
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
/foo bar/
|
|
#^[^0-9]$#
|
|
+php+
|
|
%[a-zA-Z0-9_-]%
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
</para>
|
|
<para>
|
|
If the delimiter needs to be matched inside the pattern it must be
|
|
escaped using a backslash. If the delimiter appears often inside the
|
|
pattern, it is a good idea to choose another delimiter in order to increase
|
|
readability.
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
/http:\/\//
|
|
#http://#
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
The <function>preg_quote</function> function may be used to escape a string
|
|
for injection into a pattern and its optional second parameter may be used
|
|
to specify the delimiter to be escaped.
|
|
</para>
|
|
<para>
|
|
In addition to the aforementioned delimiters, it is also possible to use
|
|
bracket style delimiters where the opening and closing brackets are the
|
|
starting and ending delimiter, respectively.
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
{this is a pattern}
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
</para>
|
|
<para>
|
|
You may add <link linkend="reference.pcre.pattern.modifiers">pattern
|
|
modifiers</link> after the ending delimiter. The following is an example
|
|
of case-insensitive matching:
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
#[a-z]#i
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
</para>
|
|
</section>
|
|
<section xml:id="regexp.reference.meta">
|
|
<title>Meta-characters</title>
|
|
<para>
|
|
The power of regular expressions comes from the
|
|
ability to include alternatives and repetitions in the
|
|
pattern. These are encoded in the pattern by the use of
|
|
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
|
|
are interpreted in some special way.
|
|
</para>
|
|
<para>
|
|
There are two different sets of meta-characters: those that
|
|
are recognized anywhere in the pattern except within square
|
|
brackets, and those that are recognized in square brackets.
|
|
Outside square brackets, the meta-characters are as follows:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\</emphasis></term>
|
|
<listitem><simpara>general escape character with several uses</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>^</emphasis></term>
|
|
<listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>$</emphasis></term>
|
|
<listitem><simpara>assert end of subject or before a terminating newline (or end of line, in multiline mode)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>.</emphasis></term>
|
|
<listitem><simpara>match any character except newline (by default)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>[</emphasis></term>
|
|
<listitem><simpara>start character class definition</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>]</emphasis></term>
|
|
<listitem><simpara>end character class definition</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>|</emphasis></term>
|
|
<listitem><simpara>start of alternative branch</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>(</emphasis></term>
|
|
<listitem><simpara>start subpattern</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>)</emphasis></term>
|
|
<listitem><simpara>end subpattern</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>?</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
extends the meaning of (, also 0 or 1 quantifier, also makes greedy
|
|
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>*</emphasis></term>
|
|
<listitem><simpara>0 or more quantifier</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>+</emphasis></term>
|
|
<listitem><simpara>1 or more quantifier</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>{</emphasis></term>
|
|
<listitem><simpara>start min/max quantifier</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>}</emphasis></term>
|
|
<listitem><simpara>end min/max quantifier</simpara></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
Part of a pattern that is in square brackets is called a
|
|
"character class". In a character class the only
|
|
meta-characters are:
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\</emphasis></term>
|
|
<listitem><simpara>general escape character</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>^</emphasis></term>
|
|
<listitem><simpara>negate the class, but only if the first character</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>-</emphasis></term>
|
|
<listitem><simpara>indicates character range</simpara></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
The following sections describe the use of each of the
|
|
meta-characters.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.escape">
|
|
<title>Escape sequences</title>
|
|
<para>
|
|
The backslash character has several uses. Firstly, if it is
|
|
followed by a non-alphanumeric character, it takes away any
|
|
special meaning that character may have. This use of
|
|
backslash as an escape character applies both inside and
|
|
outside character classes.
|
|
</para>
|
|
<para>
|
|
For example, if you want to match a "*" character, you write
|
|
"\*" in the pattern. This applies whether or not the
|
|
following character would otherwise be interpreted as a
|
|
meta-character, so it is always safe to precede a non-alphanumeric
|
|
with "\" to specify that it stands for itself. In
|
|
particular, if you want to match a backslash, you write "\\".
|
|
</para>
|
|
<note>
|
|
<para>
|
|
Single and double quoted PHP <link
|
|
linkend="language.types.string.syntax">strings</link> have special
|
|
meaning of backslash. Thus if \ has to be matched with a regular
|
|
expression \\, then "\\\\" or '\\\\' must be used in PHP code.
|
|
</para>
|
|
</note>
|
|
<para>
|
|
If a pattern is compiled with the
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option,
|
|
whitespace in the pattern (other than in a character class) and
|
|
characters between a "#" outside a character class and the next newline
|
|
character are ignored. An escaping backslash can be used to include a
|
|
whitespace or "#" character as part of the pattern.
|
|
</para>
|
|
<para>
|
|
A second use of backslash provides a way of encoding
|
|
non-printing characters in patterns in a visible manner. There
|
|
is no restriction on the appearance of non-printing characters,
|
|
apart from the binary zero that terminates a pattern,
|
|
but when a pattern is being prepared by text editing, it is
|
|
usually easier to use one of the following escape sequences
|
|
than the binary character it represents:
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\a</emphasis></term>
|
|
<listitem>
|
|
<simpara>alarm, that is, the BEL character (hex 07)</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\cx</emphasis></term>
|
|
<listitem>
|
|
<simpara>"control-x", where x is any character</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\e</emphasis></term>
|
|
<listitem>
|
|
<simpara>escape (hex 1B)</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\f</emphasis></term>
|
|
<listitem>
|
|
<simpara>formfeed (hex 0C)</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\n</emphasis></term>
|
|
<listitem>
|
|
<simpara>newline (hex 0A)</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\p{xx}</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
a character with the xx property, see
|
|
<link linkend="regexp.reference.unicode">unicode properties</link>
|
|
for more info
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\P{xx}</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
a character without the xx property, see
|
|
<link linkend="regexp.reference.unicode">unicode properties</link>
|
|
for more info
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\r</emphasis></term>
|
|
<listitem>
|
|
<simpara>carriage return (hex 0D)</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\t</emphasis></term>
|
|
<listitem>
|
|
<simpara>tab (hex 09)</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\xhh</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
character with hex code hh
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\ddd</emphasis></term>
|
|
<listitem>
|
|
<simpara>character with octal code ddd, or backreference</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
The precise effect of "<literal>\cx</literal>" is as follows:
|
|
if "<literal>x</literal>" is a lower case letter, it is converted
|
|
to upper case. Then bit 6 of the character (hex 40) is inverted.
|
|
Thus "<literal>\cz</literal>" becomes hex 1A, but
|
|
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
|
|
becomes hex 7B.
|
|
</para>
|
|
<para>
|
|
After "<literal>\x</literal>", up to two hexadecimal digits are
|
|
read (letters can be in upper or lower case).
|
|
In <emphasis>UTF-8 mode</emphasis>, "<literal>\x{...}</literal>" is
|
|
allowed, where the contents of the braces is a string of hexadecimal
|
|
digits. It is interpreted as a UTF-8 character whose code number is the
|
|
given hexadecimal number. The original hexadecimal escape sequence,
|
|
<literal>\xhh</literal>, matches a two-byte UTF-8 character if the value
|
|
is greater than 127.
|
|
</para>
|
|
<para>
|
|
After "<literal>\0</literal>" up to two further octal digits are read.
|
|
In both cases, if there are fewer than two digits, just those that
|
|
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
|
|
specifies two binary zeros followed by a BEL character. Make sure you
|
|
supply two digits after the initial zero if the character
|
|
that follows is itself an octal digit.
|
|
</para>
|
|
<para>
|
|
The handling of a backslash followed by a digit other than 0
|
|
is complicated. Outside a character class, PCRE reads it
|
|
and any following digits as a decimal number. If the number
|
|
is less than 10, or if there have been at least that many
|
|
previous capturing left parentheses in the expression, the
|
|
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
|
|
of how this works is given later, following the discussion
|
|
of parenthesized subpatterns.
|
|
</para>
|
|
<para>
|
|
Inside a character class, or if the decimal number is
|
|
greater than 9 and there have not been that many capturing
|
|
subpatterns, PCRE re-reads up to three octal digits following
|
|
the backslash, and generates a single byte from the
|
|
least significant 8 bits of the value. Any subsequent digits
|
|
stand for themselves. For example:
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\040</emphasis></term>
|
|
<listitem><simpara>is another way of writing a space</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\40</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is the same, provided there are fewer than 40
|
|
previous capturing subpatterns
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\7</emphasis></term>
|
|
<listitem><simpara>is always a back reference</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\11</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
might be a back reference, or another way of
|
|
writing a tab
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\011</emphasis></term>
|
|
<listitem><simpara>is always a tab</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\0113</emphasis></term>
|
|
<listitem><simpara>is a tab followed by the character "3"</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\113</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is the character with octal code 113 (since there
|
|
can be no more than 99 back references)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\377</emphasis></term>
|
|
<listitem><simpara>is a byte consisting entirely of 1 bits</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\81</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
is either a back reference, or a binary zero
|
|
followed by the two characters "8" and "1"
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Note that octal values of 100 or greater must not be
|
|
introduced by a leading zero, because no more than three octal
|
|
digits are ever read.
|
|
</para>
|
|
<para>
|
|
All the sequences that define a single byte value can be
|
|
used both inside and outside character classes. In addition,
|
|
inside a character class, the sequence "<literal>\b</literal>"
|
|
is interpreted as the backspace character (hex 08). Outside a character
|
|
class it has a different meaning (see below).
|
|
</para>
|
|
<para>
|
|
The third use of backslash is for specifying generic
|
|
character types:
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\d</emphasis></term>
|
|
<listitem><simpara>any decimal digit</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\D</emphasis></term>
|
|
<listitem><simpara>any character that is not a decimal digit</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\h</emphasis></term>
|
|
<listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\H</emphasis></term>
|
|
<listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\s</emphasis></term>
|
|
<listitem><simpara>any whitespace character</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\S</emphasis></term>
|
|
<listitem><simpara>any character that is not a whitespace character</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\v</emphasis></term>
|
|
<listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\V</emphasis></term>
|
|
<listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\w</emphasis></term>
|
|
<listitem><simpara>any "word" character</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\W</emphasis></term>
|
|
<listitem><simpara>any "non-word" character</simpara></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
Each pair of escape sequences partitions the complete set of
|
|
characters into two disjoint sets. Any given character
|
|
matches one, and only one, of each pair.
|
|
</para>
|
|
<para>
|
|
A "word" character is any letter or digit or the underscore
|
|
character, that is, any character which can be part of a
|
|
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
|
|
controlled by PCRE's character tables, and may vary if locale-specific
|
|
matching is taking place. For example, in the "fr" (French) locale, some
|
|
character codes greater than 128 are used for accented letters,
|
|
and these are matched by <literal>\w</literal>.
|
|
</para>
|
|
<para>
|
|
These character type sequences can appear both inside and
|
|
outside character classes. They each match one character of
|
|
the appropriate type. If the current matching point is at
|
|
the end of the subject string, all of them fail, since there
|
|
is no character to match.
|
|
</para>
|
|
<para>
|
|
The fourth use of backslash is for certain simple
|
|
assertions. An assertion specifies a condition that has to be met
|
|
at a particular point in a match, without consuming any
|
|
characters from the subject string. The use of subpatterns
|
|
for more complicated assertions is described below. The
|
|
backslashed assertions are
|
|
</para>
|
|
<para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\b</emphasis></term>
|
|
<listitem><simpara>word boundary</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\B</emphasis></term>
|
|
<listitem><simpara>not a word boundary</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\A</emphasis></term>
|
|
<listitem><simpara>start of subject (independent of multiline mode)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\Z</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
end of subject or newline at end (independent of
|
|
multiline mode)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\z</emphasis></term>
|
|
<listitem><simpara>end of subject (independent of multiline mode)</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\G</emphasis></term>
|
|
<listitem><simpara>first matching position in subject</simpara></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
<para>
|
|
These assertions may not appear in character classes (but
|
|
note that "<literal>\b</literal>" has a different meaning, namely the backspace
|
|
character, inside a character class).
|
|
</para>
|
|
<para>
|
|
A word boundary is a position in the subject string where
|
|
the current character and the previous character do not both
|
|
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
|
|
<literal>\w</literal> and the other matches
|
|
<literal>\W</literal>), or the start or end of the string if the first
|
|
or last character matches <literal>\w</literal>, respectively.
|
|
</para>
|
|
<para>
|
|
The <literal>\A</literal>, <literal>\Z</literal>, and
|
|
<literal>\z</literal> assertions differ from the traditional
|
|
circumflex and dollar (described below) in that they only
|
|
ever match at the very start and end of the subject string,
|
|
whatever options are set. They are not affected by the
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
|
options. The difference between <literal>\Z</literal> and
|
|
<literal>\z</literal> is that <literal>\Z</literal> matches before a
|
|
newline that is the last character of the string as well as at the end of
|
|
the string, whereas <literal>\z</literal> matches only at the end.
|
|
</para>
|
|
<para>
|
|
The <literal>\G</literal> assertion is true only when the current
|
|
matching position is at the start point of the match, as specified by
|
|
the <parameter>offset</parameter> argument of
|
|
<function>preg_match</function>. It differs from <literal>\A</literal>
|
|
when the value of <parameter>offset</parameter> is non-zero.
|
|
</para>
|
|
|
|
<para>
|
|
<literal>\Q</literal> and <literal>\E</literal> can be used to ignore
|
|
regexp metacharacters in the pattern. For example:
|
|
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
|
|
followed by literals <literal>.$.</literal> and anchored at the end of
|
|
the string.
|
|
</para>
|
|
|
|
<para>
|
|
<literal>\K</literal> can be used to reset the match start since
|
|
PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches
|
|
"foobar", but reports that it has matched "bar". The use of
|
|
<literal>\K</literal> does not interfere with the setting of captured
|
|
substrings. For example, when the pattern <literal>(foo)\Kbar</literal>
|
|
matches "foobar", the first substring is still set to "foo".
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.unicode">
|
|
<title>Unicode character properties</title>
|
|
<para>
|
|
Since 5.1.0, three
|
|
additional escape sequences to match generic character types are available
|
|
when <emphasis>UTF-8 mode</emphasis> is selected. They are:
|
|
</para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>\p{xx}</emphasis></term>
|
|
<listitem><simpara>a character with the xx property</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\P{xx}</emphasis></term>
|
|
<listitem><simpara>a character without the xx property</simpara></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term><emphasis>\X</emphasis></term>
|
|
<listitem><simpara>an extended Unicode sequence</simpara></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
<para>
|
|
The property names represented by <literal>xx</literal> above are limited
|
|
to the Unicode general category properties. Each character has exactly one
|
|
such property, specified by a two-letter abbreviation. For compatibility with
|
|
Perl, negation can be specified by including a circumflex between the
|
|
opening brace and the property name. For example, <literal>\p{^Lu}</literal>
|
|
is the same as <literal>\P{Lu}</literal>.
|
|
</para>
|
|
<para>
|
|
If only one letter is specified with <literal>\p</literal> or
|
|
<literal>\P</literal>, it includes all the properties that start with that
|
|
letter. In this case, in the absence of negation, the curly brackets in the
|
|
escape sequence are optional; these two examples have the same effect:
|
|
</para>
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
\p{L}
|
|
\pL
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
<table>
|
|
<title>Supported property codes</title>
|
|
<tgroup cols="3">
|
|
<thead>
|
|
<row>
|
|
<entry>Property</entry>
|
|
<entry>Matches</entry>
|
|
<entry>Notes</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>C</literal></entry>
|
|
<entry>Other</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Cc</literal></entry>
|
|
<entry>Control</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Cf</literal></entry>
|
|
<entry>Format</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Cn</literal></entry>
|
|
<entry>Unassigned</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Co</literal></entry>
|
|
<entry>Private use</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row rowsep="1">
|
|
<entry><literal>Cs</literal></entry>
|
|
<entry>Surrogate</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>L</literal></entry>
|
|
<entry>Letter</entry>
|
|
<entry>
|
|
Includes the following properties: <literal>Ll</literal>,
|
|
<literal>Lm</literal>, <literal>Lo</literal>, <literal>Lt</literal> and
|
|
<literal>Lu</literal>.
|
|
</entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Ll</literal></entry>
|
|
<entry>Lower case letter</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Lm</literal></entry>
|
|
<entry>Modifier letter</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Lo</literal></entry>
|
|
<entry>Other letter</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Lt</literal></entry>
|
|
<entry>Title case letter</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row rowsep="1">
|
|
<entry><literal>Lu</literal></entry>
|
|
<entry>Upper case letter</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>M</literal></entry>
|
|
<entry>Mark</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Mc</literal></entry>
|
|
<entry>Spacing mark</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Me</literal></entry>
|
|
<entry>Enclosing mark</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row rowsep="1">
|
|
<entry><literal>Mn</literal></entry>
|
|
<entry>Non-spacing mark</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>N</literal></entry>
|
|
<entry>Number</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Nd</literal></entry>
|
|
<entry>Decimal number</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Nl</literal></entry>
|
|
<entry>Letter number</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row rowsep="1">
|
|
<entry><literal>No</literal></entry>
|
|
<entry>Other number</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>P</literal></entry>
|
|
<entry>Punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Pc</literal></entry>
|
|
<entry>Connector punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Pd</literal></entry>
|
|
<entry>Dash punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Pe</literal></entry>
|
|
<entry>Close punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Pf</literal></entry>
|
|
<entry>Final punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Pi</literal></entry>
|
|
<entry>Initial punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Po</literal></entry>
|
|
<entry>Other punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row rowsep="1">
|
|
<entry><literal>Ps</literal></entry>
|
|
<entry>Open punctuation</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>S</literal></entry>
|
|
<entry>Symbol</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Sc</literal></entry>
|
|
<entry>Currency symbol</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Sk</literal></entry>
|
|
<entry>Modifier symbol</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Sm</literal></entry>
|
|
<entry>Mathematical symbol</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row rowsep="1">
|
|
<entry><literal>So</literal></entry>
|
|
<entry>Other symbol</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Z</literal></entry>
|
|
<entry>Separator</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Zl</literal></entry>
|
|
<entry>Line separator</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Zp</literal></entry>
|
|
<entry>Paragraph separator</entry>
|
|
<entry></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Zs</literal></entry>
|
|
<entry>Space separator</entry>
|
|
<entry></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
<para>
|
|
Extended properties such as <literal>InMusicalSymbols</literal> are not
|
|
supported by PCRE.
|
|
</para>
|
|
<para>
|
|
Specifying case-insensitive (caseless) matching does not affect these escape sequences.
|
|
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
|
|
</para>
|
|
<para>
|
|
Sets of Unicode characters are defined as belonging to certain scripts. A
|
|
character from one of these sets can be matched using a script name. For
|
|
example:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara><literal>\p{Greek}</literal></simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara><literal>\P{Han}</literal></simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
Those that are not part of an identified script are lumped together as
|
|
<literal>Common</literal>. The current list of scripts is:
|
|
</para>
|
|
<table>
|
|
<title>Supported scripts</title>
|
|
<tgroup cols="5">
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>Arabic</literal></entry>
|
|
<entry><literal>Armenian</literal></entry>
|
|
<entry><literal>Avestan</literal></entry>
|
|
<entry><literal>Balinese</literal></entry>
|
|
<entry><literal>Bamum</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Batak</literal></entry>
|
|
<entry><literal>Bengali</literal></entry>
|
|
<entry><literal>Bopomofo</literal></entry>
|
|
<entry><literal>Brahmi</literal></entry>
|
|
<entry><literal>Braille</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Buginese</literal></entry>
|
|
<entry><literal>Buhid</literal></entry>
|
|
<entry><literal>Canadian_Aboriginal</literal></entry>
|
|
<entry><literal>Carian</literal></entry>
|
|
<entry><literal>Chakma</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Cham</literal></entry>
|
|
<entry><literal>Cherokee</literal></entry>
|
|
<entry><literal>Common</literal></entry>
|
|
<entry><literal>Coptic</literal></entry>
|
|
<entry><literal>Cuneiform</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Cypriot</literal></entry>
|
|
<entry><literal>Cyrillic</literal></entry>
|
|
<entry><literal>Deseret</literal></entry>
|
|
<entry><literal>Devanagari</literal></entry>
|
|
<entry><literal>Egyptian_Hieroglyphs</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Ethiopic</literal></entry>
|
|
<entry><literal>Georgian</literal></entry>
|
|
<entry><literal>Glagolitic</literal></entry>
|
|
<entry><literal>Gothic</literal></entry>
|
|
<entry><literal>Greek</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Gujarati</literal></entry>
|
|
<entry><literal>Gurmukhi</literal></entry>
|
|
<entry><literal>Han</literal></entry>
|
|
<entry><literal>Hangul</literal></entry>
|
|
<entry><literal>Hanunoo</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Hebrew</literal></entry>
|
|
<entry><literal>Hiragana</literal></entry>
|
|
<entry><literal>Imperial_Aramaic</literal></entry>
|
|
<entry><literal>Inherited</literal></entry>
|
|
<entry><literal>Inscriptional_Pahlavi</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Inscriptional_Parthian</literal></entry>
|
|
<entry><literal>Javanese</literal></entry>
|
|
<entry><literal>Kaithi</literal></entry>
|
|
<entry><literal>Kannada</literal></entry>
|
|
<entry><literal>Katakana</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Kayah_Li</literal></entry>
|
|
<entry><literal>Kharoshthi</literal></entry>
|
|
<entry><literal>Khmer</literal></entry>
|
|
<entry><literal>Lao</literal></entry>
|
|
<entry><literal>Latin</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Lepcha</literal></entry>
|
|
<entry><literal>Limbu</literal></entry>
|
|
<entry><literal>Linear_B</literal></entry>
|
|
<entry><literal>Lisu</literal></entry>
|
|
<entry><literal>Lycian</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Lydian</literal></entry>
|
|
<entry><literal>Malayalam</literal></entry>
|
|
<entry><literal>Mandaic</literal></entry>
|
|
<entry><literal>Meetei_Mayek</literal></entry>
|
|
<entry><literal>Meroitic_Cursive</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Meroitic_Hieroglyphs</literal></entry>
|
|
<entry><literal>Miao</literal></entry>
|
|
<entry><literal>Mongolian</literal></entry>
|
|
<entry><literal>Myanmar</literal></entry>
|
|
<entry><literal>New_Tai_Lue</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Nko</literal></entry>
|
|
<entry><literal>Ogham</literal></entry>
|
|
<entry><literal>Old_Italic</literal></entry>
|
|
<entry><literal>Old_Persian</literal></entry>
|
|
<entry><literal>Old_South_Arabian</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Old_Turkic</literal></entry>
|
|
<entry><literal>Ol_Chiki</literal></entry>
|
|
<entry><literal>Oriya</literal></entry>
|
|
<entry><literal>Osmanya</literal></entry>
|
|
<entry><literal>Phags_Pa</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Phoenician</literal></entry>
|
|
<entry><literal>Rejang</literal></entry>
|
|
<entry><literal>Runic</literal></entry>
|
|
<entry><literal>Samaritan</literal></entry>
|
|
<entry><literal>Saurashtra</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Sharada</literal></entry>
|
|
<entry><literal>Shavian</literal></entry>
|
|
<entry><literal>Sinhala</literal></entry>
|
|
<entry><literal>Sora_Sompeng</literal></entry>
|
|
<entry><literal>Sundanese</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Syloti_Nagri</literal></entry>
|
|
<entry><literal>Syriac</literal></entry>
|
|
<entry><literal>Tagalog</literal></entry>
|
|
<entry><literal>Tagbanwa</literal></entry>
|
|
<entry><literal>Tai_Le</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Tai_Tham</literal></entry>
|
|
<entry><literal>Tai_Viet</literal></entry>
|
|
<entry><literal>Takri</literal></entry>
|
|
<entry><literal>Tamil</literal></entry>
|
|
<entry><literal>Telugu</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Thaana</literal></entry>
|
|
<entry><literal>Thai</literal></entry>
|
|
<entry><literal>Tibetan</literal></entry>
|
|
<entry><literal>Tifinagh</literal></entry>
|
|
<entry><literal>Ugaritic</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>Vai</literal></entry>
|
|
<entry><literal>Yi</literal></entry>
|
|
<entry />
|
|
<entry />
|
|
<entry />
|
|
<entry />
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
<para>
|
|
The <literal>\X</literal> escape matches any number of Unicode characters
|
|
that form an extended Unicode sequence. <literal>\X</literal> is equivalent
|
|
to <literal>(?>\PM\pM*)</literal>.
|
|
</para>
|
|
<para>
|
|
That is, it matches a character without the "mark" property, followed
|
|
by zero or more characters with the "mark" property, and treats the
|
|
sequence as an atomic group (see below). Characters with the "mark"
|
|
property are typically accents that affect the preceding character.
|
|
</para>
|
|
<para>
|
|
Matching characters by Unicode property is not fast, because PCRE has
|
|
to search a structure that contains data for over fifteen thousand
|
|
characters. That is why the traditional escape sequences such as
|
|
<literal>\d</literal> and <literal>\w</literal> do not use Unicode properties
|
|
in PCRE.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.anchors">
|
|
<title>Anchors</title>
|
|
<para>
|
|
Outside a character class, in the default matching mode, the
|
|
circumflex character (<literal>^</literal>) is an assertion which
|
|
is true only if the current matching point is at the start of
|
|
the subject string. Inside a character class, circumflex (<literal>^</literal>)
|
|
has an entirely different meaning (see below).
|
|
</para>
|
|
<para>
|
|
Circumflex (<literal>^</literal>) need not be the first character
|
|
of the pattern if a number of alternatives are involved, but it
|
|
should be the first thing in each alternative in which it appears
|
|
if the pattern is ever to match that branch. If all possible
|
|
alternatives start with a circumflex (<literal>^</literal>), that is,
|
|
if the pattern is constrained to match only at the start of the subject,
|
|
it is said to be an "anchored" pattern. (There are also other
|
|
constructs that can cause a pattern to be anchored.)
|
|
</para>
|
|
<para>
|
|
A dollar character (<literal>$</literal>) is an assertion which is
|
|
&true; only if the current matching point is at the end of the subject
|
|
string, or immediately before a newline character that is the last
|
|
character in the string (by default). Dollar (<literal>$</literal>)
|
|
need not be the last character of the pattern if a number of
|
|
alternatives are involved, but it should be the last item in any branch
|
|
in which it appears. Dollar has no special meaning in a
|
|
character class.
|
|
</para>
|
|
<para>
|
|
The meaning of dollar can be changed so that it matches only
|
|
at the very end of the string, by setting the
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
|
option at compile or matching time. This does not affect the \Z assertion.
|
|
</para>
|
|
<para>
|
|
The meanings of the circumflex and dollar characters are
|
|
changed if the
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> option
|
|
is set. When this is the case, they match immediately after and
|
|
immediately before an internal "\n" character, respectively, in addition
|
|
to matching at the start and end of the subject string. For example, the
|
|
pattern /^abc$/ matches the subject string "def\nabc" in multiline mode,
|
|
but not otherwise. Consequently, patterns that are anchored in single
|
|
line mode because all branches start with "^" are not anchored in
|
|
multiline mode. The
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
|
option is ignored if
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is
|
|
set.
|
|
</para>
|
|
<para>
|
|
Note that the sequences \A, \Z, and \z can be used to match
|
|
the start and end of the subject in both modes, and if all
|
|
branches of a pattern start with \A is it always anchored,
|
|
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
|
|
is set or not.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.dot">
|
|
<title>Dot</title>
|
|
<para>
|
|
Outside a character class, a dot in the pattern matches any
|
|
one character in the subject, including a non-printing
|
|
character, but not (by default) newline. If the
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
option is set, then dots match newlines as well. The
|
|
handling of dot is entirely independent of the handling of
|
|
circumflex and dollar, the only relationship being that they
|
|
both involve newline characters. Dot has no special meaning
|
|
in a character class.
|
|
</para>
|
|
<para>
|
|
<emphasis>\C</emphasis> can be used to match single byte. It makes sense
|
|
in <emphasis>UTF-8 mode</emphasis> where full stop matches the whole
|
|
character which can consist of multiple bytes.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.character-classes">
|
|
<title>Character classes</title>
|
|
<para>
|
|
An opening square bracket introduces a character class,
|
|
terminated by a closing square bracket. A closing square
|
|
bracket on its own is not special. If a closing square
|
|
bracket is required as a member of the class, it should be
|
|
the first data character in the class (after an initial
|
|
circumflex, if present) or escaped with a backslash.
|
|
</para>
|
|
<para>
|
|
A character class matches a single character in the subject;
|
|
the character must be in the set of characters defined by
|
|
the class, unless the first character in the class is a
|
|
circumflex, in which case the subject character must not be in
|
|
the set defined by the class. If a circumflex is actually
|
|
required as a member of the class, ensure it is not the
|
|
first character, or escape it with a backslash.
|
|
</para>
|
|
<para>
|
|
For example, the character class [aeiou] matches any lower
|
|
case vowel, while [^aeiou] matches any character that is not
|
|
a lower case vowel. Note that a circumflex is just a
|
|
convenient notation for specifying the characters which are in
|
|
the class by enumerating those that are not. It is not an
|
|
assertion: it still consumes a character from the subject
|
|
string, and fails if the current pointer is at the end of
|
|
the string.
|
|
</para>
|
|
<para>
|
|
When case-insensitive (caseless) matching is set, any letters
|
|
in a class represent both their upper case and lower case
|
|
versions, so for example, an insensitive [aeiou] matches "A"
|
|
as well as "a", and an insensitive [^aeiou] does not match
|
|
"A", whereas a sensitive (caseful) version would.
|
|
</para>
|
|
<para>
|
|
The newline character is never treated in any special way in
|
|
character classes, whatever the setting of the <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
|
|
options is. A class such as [^a] will always match a newline.
|
|
</para>
|
|
<para>
|
|
The minus (hyphen) character can be used to specify a range
|
|
of characters in a character class. For example, [d-m]
|
|
matches any letter between d and m, inclusive. If a minus
|
|
character is required in a class, it must be escaped with a
|
|
backslash or appear in a position where it cannot be
|
|
interpreted as indicating a range, typically as the first or last
|
|
character in the class.
|
|
</para>
|
|
<para>
|
|
It is not possible to have the literal character "]" as the
|
|
end character of a range. A pattern such as [W-]46] is
|
|
interpreted as a class of two characters ("W" and "-")
|
|
followed by a literal string "46]", so it would match "W46]" or
|
|
"-46]". However, if the "]" is escaped with a backslash it
|
|
is interpreted as the end of range, so [W-\]46] is
|
|
interpreted as a single class containing a range followed by two
|
|
separate characters. The octal or hexadecimal representation
|
|
of "]" can also be used to end a range.
|
|
</para>
|
|
<para>
|
|
Ranges operate in ASCII collating sequence. They can also be
|
|
used for characters specified numerically, for example
|
|
[\000-\037]. If a range that includes letters is used when
|
|
case-insensitive (caseless) matching is set, it matches the
|
|
letters in either case. For example, [W-c] is equivalent to
|
|
[][\^_`wxyzabc], matched case-insensitively, and if character
|
|
tables for the "fr" locale are in use, [\xc8-\xcb] matches
|
|
accented E characters in both cases.
|
|
</para>
|
|
<para>
|
|
The character types \d, \D, \s, \S, \w, and \W may also
|
|
appear in a character class, and add the characters that
|
|
they match to the class. For example, [\dABCDEF] matches any
|
|
hexadecimal digit. A circumflex can conveniently be used
|
|
with the upper case character types to specify a more
|
|
restricted set of characters than the matching lower case type.
|
|
For example, the class [^\W_] matches any letter or digit,
|
|
but not underscore.
|
|
</para>
|
|
<para>
|
|
All non-alphanumeric characters other than \, -, ^ (at the
|
|
start) and the terminating ] are non-special in character
|
|
classes, but it does no harm if they are escaped. The pattern
|
|
terminator is always special and must be escaped when used
|
|
within an expression.
|
|
</para>
|
|
<para>
|
|
Perl supports the POSIX notation for character classes. This uses names
|
|
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
|
|
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
|
|
matches "0", "1", any alphabetic character, or "%". The supported class
|
|
names are:
|
|
<table>
|
|
<title>Character classes</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row><entry><literal>alnum</literal></entry><entry>letters and digits</entry></row>
|
|
<row><entry><literal>alpha</literal></entry><entry>letters</entry></row>
|
|
<row><entry><literal>ascii</literal></entry><entry>character codes 0 - 127</entry></row>
|
|
<row><entry><literal>blank</literal></entry><entry>space or tab only</entry></row>
|
|
<row><entry><literal>cntrl</literal></entry><entry>control characters</entry></row>
|
|
<row><entry><literal>digit</literal></entry><entry>decimal digits (same as \d)</entry></row>
|
|
<row><entry><literal>graph</literal></entry><entry>printing characters, excluding space</entry></row>
|
|
<row><entry><literal>lower</literal></entry><entry>lower case letters</entry></row>
|
|
<row><entry><literal>print</literal></entry><entry>printing characters, including space</entry></row>
|
|
<row><entry><literal>punct</literal></entry><entry>printing characters, excluding letters and digits</entry></row>
|
|
<row><entry><literal>space</literal></entry><entry>white space (not quite the same as \s)</entry></row>
|
|
<row><entry><literal>upper</literal></entry><entry>upper case letters</entry></row>
|
|
<row><entry><literal>word</literal></entry><entry>"word" characters (same as \w)</entry></row>
|
|
<row><entry><literal>xdigit</literal></entry><entry>hexadecimal digits</entry></row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
The <literal>space</literal> characters are HT (9), LF (10), VT (11), FF (12), CR (13),
|
|
and space (32). Notice that this list includes the VT character (code
|
|
11). This makes "space" different to <literal>\s</literal>, which does not include VT (for
|
|
Perl compatibility).
|
|
</para>
|
|
<para>
|
|
The name <literal>word</literal> is a Perl extension, and <literal>blank</literal> is a GNU extension
|
|
from Perl 5.8. Another Perl extension is negation, which is indicated
|
|
by a <literal>^</literal> character after the colon. For example,
|
|
<literal>[12[:^digit:]]</literal> matches "1", "2", or any non-digit.
|
|
</para>
|
|
<para>
|
|
In UTF-8 mode, characters with values greater than 128 do not match any
|
|
of the POSIX character classes.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.alternation">
|
|
<title>Alternation</title>
|
|
<para>
|
|
Vertical bar characters are used to separate alternative
|
|
patterns. For example, the pattern
|
|
<literal>gilbert|sullivan</literal>
|
|
matches either "gilbert" or "sullivan". Any number of alternatives
|
|
may appear, and an empty alternative is permitted
|
|
(matching the empty string). The matching process tries
|
|
each alternative in turn, from left to right, and the first
|
|
one that succeeds is used. If the alternatives are within a
|
|
subpattern (defined below), "succeeds" means matching the
|
|
rest of the main pattern as well as the alternative in the
|
|
subpattern.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.internal-options">
|
|
<title>Internal option setting</title>
|
|
<para>
|
|
The settings of <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link>,
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>,
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>,
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
|
|
and PCRE_DUPNAMES can be changed from within the pattern by
|
|
a sequence of Perl option letters enclosed between "(?" and
|
|
")". The option letters are:
|
|
|
|
<table>
|
|
<title>Internal option letters</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>i</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>m</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>s</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>x</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>U</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>X</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>J</literal></entry>
|
|
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_INFO_JCHANGED</link></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</para>
|
|
<para>
|
|
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
|
|
also possible to unset these options by preceding the letter
|
|
with a hyphen, and a combined setting and unsetting such as
|
|
(?im-sx), which sets <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
|
|
while unsetting <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
|
|
is also permitted. If a letter appears both before and after the
|
|
hyphen, the option is unset.
|
|
</para>
|
|
<para>
|
|
When an option change occurs at top level (that is, not inside
|
|
subpattern parentheses), the change applies to the remainder of the
|
|
pattern that follows. So <literal>/ab(?i)c/</literal> matches only "abc"
|
|
and "abC".
|
|
</para>
|
|
<para>
|
|
If an option change occurs inside a subpattern, the effect
|
|
is different. This is a change of behaviour in Perl 5.005.
|
|
An option change inside a subpattern affects only that part
|
|
of the subpattern that follows it, so
|
|
|
|
<literal>(a(?i)b)c</literal>
|
|
|
|
matches abc and aBc and no other strings (assuming <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
|
|
used). By this means, options can be made to have different settings in
|
|
different parts of the pattern. Any changes made in one alternative do
|
|
carry on into subsequent branches within the same subpattern. For
|
|
example,
|
|
|
|
<literal>(a(?i)b|c)</literal>
|
|
|
|
matches "ab", "aB", "c", and "C", even though when matching
|
|
"C" the first branch is abandoned before the option setting.
|
|
This is because the effects of option settings happen at
|
|
compile time. There would be some very weird behaviour otherwise.
|
|
</para>
|
|
<para>
|
|
The PCRE-specific options <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
|
|
be changed in the same way as the Perl-compatible options by
|
|
using the characters U and X respectively. The (?X) flag
|
|
setting is special in that it must always occur earlier in
|
|
the pattern than any of the additional features it turns on,
|
|
even when it is at top level. It is best put at the start.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.subpatterns">
|
|
<title>Subpatterns</title>
|
|
<para>
|
|
Subpatterns are delimited by parentheses (round brackets),
|
|
which can be nested. Marking part of a pattern as a subpattern
|
|
does two things:
|
|
</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>
|
|
It localizes a set of alternatives. For example, the pattern
|
|
<literal>cat(aract|erpillar|)</literal> matches one of the words "cat",
|
|
"cataract", or "caterpillar". Without the parentheses, it would match
|
|
"cataract", "erpillar" or the empty string.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
It sets up the subpattern as a capturing subpattern (as defined above).
|
|
When the whole pattern matches, that portion of the subject string
|
|
that matched the subpattern is passed back to the caller via the
|
|
<emphasis>ovector</emphasis> argument of <function>pcre_exec</function>.
|
|
Opening parentheses are counted from left to right (starting from 1) to
|
|
obtain the numbers of the capturing subpatterns.
|
|
</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
<para>
|
|
For example, if the string "the red king" is matched against
|
|
the pattern
|
|
|
|
<literal>the ((red|white) (king|queen))</literal>
|
|
|
|
the captured substrings are "red king", "red", and "king",
|
|
and are numbered 1, 2, and 3.
|
|
</para>
|
|
<para>
|
|
The fact that plain parentheses fulfill two functions is not
|
|
always helpful. There are often times when a grouping subpattern
|
|
is required without a capturing requirement. If an
|
|
opening parenthesis is followed by "?:", the subpattern does
|
|
not do any capturing, and is not counted when computing the
|
|
number of any subsequent capturing subpatterns. For example,
|
|
if the string "the white queen" is matched against the
|
|
pattern
|
|
|
|
<literal>the ((?:red|white) (king|queen))</literal>
|
|
|
|
the captured substrings are "white queen" and "queen", and
|
|
are numbered 1 and 2. The maximum number of captured substrings
|
|
is 99, and the maximum number of all subpatterns,
|
|
both capturing and non-capturing, is 200.
|
|
</para>
|
|
<para>
|
|
As a convenient shorthand, if any option settings are
|
|
required at the start of a non-capturing subpattern, the
|
|
option letters may appear between the "?" and the ":". Thus
|
|
the two patterns
|
|
</para>
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
(?i:saturday|sunday)
|
|
(?:(?i)saturday|sunday)
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
<para>
|
|
match exactly the same set of strings. Because alternative
|
|
branches are tried from left to right, and options are not
|
|
reset until the end of the subpattern is reached, an option
|
|
setting in one branch does affect subsequent branches, so
|
|
the above patterns match "SUNDAY" as well as "Saturday".
|
|
</para>
|
|
|
|
<para>
|
|
It is possible to name a subpattern using the syntax
|
|
<literal>(?P<name>pattern)</literal>. This subpattern will then
|
|
be indexed in the matches array by its normal numeric position and
|
|
also by name. PHP 5.2.2 introduced two alternative syntaxes
|
|
<literal>(?<name>pattern)</literal> and <literal>(?'name'pattern)</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
Sometimes it is necessary to have multiple matching, but alternating
|
|
subgroups in a regular expression. Normally, each of these would be given
|
|
their own backreference number even though only one of them would ever
|
|
possibly match. To overcome this, the <literal>(?|</literal> syntax allows
|
|
having duplicate numbers. Consider the following regex matched against the
|
|
string <literal>Sunday</literal>:
|
|
</para>
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[(?:(Sat)ur|(Sun))day]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
<para>
|
|
Here <literal>Sun</literal> is stored in backreference 2, while
|
|
backreference 1 is empty. Matching yields <literal>Sat</literal> in
|
|
backreference 1 while backreference 2 does not exist. Changing the pattern
|
|
to use the <literal>(?|</literal> fixes this problem:
|
|
</para>
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[(?|(Sat)ur|(Sun))day]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
<para>
|
|
Using this pattern, both <literal>Sun</literal> and <literal>Sat</literal>
|
|
would be stored in backreference 1.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.repetition">
|
|
<title>Repetition</title>
|
|
<para>
|
|
Repetition is specified by quantifiers, which can follow any
|
|
of the following items:
|
|
|
|
<itemizedlist>
|
|
<listitem><simpara>a single character, possibly escaped</simpara></listitem>
|
|
<listitem><simpara>the . metacharacter</simpara></listitem>
|
|
<listitem><simpara>a character class</simpara></listitem>
|
|
<listitem><simpara>a back reference (see next section)</simpara></listitem>
|
|
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
|
|
see below)</simpara></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>
|
|
The general repetition quantifier specifies a minimum and
|
|
maximum number of permitted matches, by giving the two
|
|
numbers in curly brackets (braces), separated by a comma.
|
|
The numbers must be less than 65536, and the first must be
|
|
less than or equal to the second. For example:
|
|
|
|
<literal>z{2,4}</literal>
|
|
|
|
matches "zz", "zzz", or "zzzz". A closing brace on its own
|
|
is not a special character. If the second number is omitted,
|
|
but the comma is present, there is no upper limit; if the
|
|
second number and the comma are both omitted, the quantifier
|
|
specifies an exact number of required matches. Thus
|
|
|
|
<literal>[aeiou]{3,}</literal>
|
|
|
|
matches at least 3 successive vowels, but may match many
|
|
more, while
|
|
|
|
<literal>\d{8}</literal>
|
|
|
|
matches exactly 8 digits. An opening curly bracket that
|
|
appears in a position where a quantifier is not allowed, or
|
|
one that does not match the syntax of a quantifier, is taken
|
|
as a literal character. For example, {,6} is not a quantifier,
|
|
but a literal string of four characters.
|
|
</para>
|
|
<para>
|
|
The quantifier {0} is permitted, causing the expression to
|
|
behave as if the previous item and the quantifier were not
|
|
present.
|
|
</para>
|
|
<para>
|
|
For convenience (and historical compatibility) the three
|
|
most common quantifiers have single-character abbreviations:
|
|
|
|
<table>
|
|
<title>Single-character quantifiers</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry><literal>*</literal></entry>
|
|
<entry>equivalent to <literal>{0,}</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>+</literal></entry>
|
|
<entry>equivalent to <literal>{1,}</literal></entry>
|
|
</row>
|
|
<row>
|
|
<entry><literal>?</literal></entry>
|
|
<entry>equivalent to <literal>{0,1}</literal></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</para>
|
|
<para>
|
|
It is possible to construct infinite loops by following a
|
|
subpattern that can match no characters with a quantifier
|
|
that has no upper limit, for example:
|
|
|
|
<literal>(a?)*</literal>
|
|
</para>
|
|
<para>
|
|
Earlier versions of Perl and PCRE used to give an error at
|
|
compile time for such patterns. However, because there are
|
|
cases where this can be useful, such patterns are now
|
|
accepted, but if any repetition of the subpattern does in
|
|
fact match no characters, the loop is forcibly broken.
|
|
</para>
|
|
<para>
|
|
By default, the quantifiers are "greedy", that is, they
|
|
match as much as possible (up to the maximum number of permitted
|
|
times), without causing the rest of the pattern to
|
|
fail. The classic example of where this gives problems is in
|
|
trying to match comments in C programs. These appear between
|
|
the sequences /* and */ and within the sequence, individual
|
|
* and / characters may appear. An attempt to match C comments
|
|
by applying the pattern
|
|
|
|
<literal>/\*.*\*/</literal>
|
|
|
|
to the string
|
|
|
|
<literal>/* first comment */ not comment /* second comment */</literal>
|
|
|
|
fails, because it matches the entire string due to the
|
|
greediness of the .* item.
|
|
</para>
|
|
<para>
|
|
However, if a quantifier is followed by a question mark,
|
|
then it becomes lazy, and instead matches the minimum
|
|
number of times possible, so the pattern
|
|
|
|
<literal>/\*.*?\*/</literal>
|
|
|
|
does the right thing with the C comments. The meaning of the
|
|
various quantifiers is not otherwise changed, just the preferred
|
|
number of matches. Do not confuse this use of
|
|
question mark with its use as a quantifier in its own right.
|
|
Because it has two uses, it can sometimes appear doubled, as
|
|
in
|
|
|
|
<literal>\d??\d</literal>
|
|
|
|
which matches one digit by preference, but can match two if
|
|
that is the only way the rest of the pattern matches.
|
|
</para>
|
|
<para>
|
|
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
|
|
option is set (an option which is not
|
|
available in Perl) then the quantifiers are not greedy by
|
|
default, but individual ones can be made greedy by following
|
|
them with a question mark. In other words, it inverts the
|
|
default behaviour.
|
|
</para>
|
|
<para>
|
|
Quantifiers followed by <literal>+</literal> are "possessive". They eat
|
|
as many characters as possible and don't return to match the rest of the
|
|
pattern. Thus <literal>.*abc</literal> matches "aabc" but
|
|
<literal>.*+abc</literal> doesn't because <literal>.*+</literal> eats the
|
|
whole string. Possessive quantifiers can be used to speed up processing.
|
|
</para>
|
|
<para>
|
|
When a parenthesized subpattern is quantified with a minimum
|
|
repeat count that is greater than 1 or with a limited maximum,
|
|
more store is required for the compiled pattern, in
|
|
proportion to the size of the minimum or maximum.
|
|
</para>
|
|
<para>
|
|
If a pattern starts with .* or .{0,} and the <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
option (equivalent to Perl's /s) is set, thus allowing the .
|
|
to match newlines, then the pattern is implicitly anchored,
|
|
because whatever follows will be tried against every character
|
|
position in the subject string, so there is no point in
|
|
retrying the overall match at any position after the first.
|
|
PCRE treats such a pattern as though it were preceded by \A.
|
|
In cases where it is known that the subject string contains
|
|
no newlines, it is worth setting <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
|
|
pattern begins with .* in order to
|
|
obtain this optimization, or
|
|
alternatively using ^ to indicate anchoring explicitly.
|
|
</para>
|
|
<para>
|
|
When a capturing subpattern is repeated, the value captured
|
|
is the substring that matched the final iteration. For example, after
|
|
|
|
<literal>(tweedle[dume]{3}\s*)+</literal>
|
|
|
|
has matched "tweedledum tweedledee" the value of the captured
|
|
substring is "tweedledee". However, if there are
|
|
nested capturing subpatterns, the corresponding captured
|
|
values may have been set in previous iterations. For example,
|
|
after
|
|
|
|
<literal>/(a|(b))+/</literal>
|
|
|
|
matches "aba" the value of the second captured substring is
|
|
"b".
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.back-references">
|
|
<title>Back references</title>
|
|
<para>
|
|
Outside a character class, a backslash followed by a digit
|
|
greater than 0 (and possibly further digits) is a back
|
|
reference to a capturing subpattern earlier (i.e. to its
|
|
left) in the pattern, provided there have been that many
|
|
previous capturing left parentheses.
|
|
</para>
|
|
<para>
|
|
However, if the decimal number following the backslash is
|
|
less than 10, it is always taken as a back reference, and
|
|
causes an error only if there are not that many capturing
|
|
left parentheses in the entire pattern. In other words, the
|
|
parentheses that are referenced need not be to the left of
|
|
the reference for numbers less than 10.
|
|
A "forward back reference" can make sense when a repetition
|
|
is involved and the subpattern to the right has participated
|
|
in an earlier iteration. See the section
|
|
entitled "Backslash" above for further details of the handling
|
|
of digits following a backslash.
|
|
</para>
|
|
<para>
|
|
A back reference matches whatever actually matched the capturing
|
|
subpattern in the current subject string, rather than
|
|
anything matching the subpattern itself. So the pattern
|
|
|
|
<literal>(sens|respons)e and \1ibility</literal>
|
|
|
|
matches "sense and sensibility" and "response and responsibility",
|
|
but not "sense and responsibility". If case-sensitive (caseful)
|
|
matching is in force at the time of the back reference, then
|
|
the case of letters is relevant. For example,
|
|
|
|
<literal>((?i)rah)\s+\1</literal>
|
|
|
|
matches "rah rah" and "RAH RAH", but not "RAH rah", even
|
|
though the original capturing subpattern is matched
|
|
case-insensitively (caselessly).
|
|
</para>
|
|
<para>
|
|
There may be more than one back reference to the same subpattern.
|
|
If a subpattern has not actually been used in a
|
|
particular match, then any back references to it always
|
|
fail. For example, the pattern
|
|
|
|
<literal>(a|(bc))\2</literal>
|
|
|
|
always fails if it starts to match "a" rather than "bc".
|
|
Because there may be up to 99 back references, all digits
|
|
following the backslash are taken as part of a potential
|
|
back reference number. If the pattern continues with a digit
|
|
character, then some delimiter must be used to terminate the
|
|
back reference. If the <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
|
|
is set, this can be whitespace. Otherwise an empty comment can be used.
|
|
</para>
|
|
<para>
|
|
A back reference that occurs inside the parentheses to which
|
|
it refers fails when the subpattern is first used, so, for
|
|
example, (a\1) never matches. However, such references can
|
|
be useful inside repeated subpatterns. For example, the pattern
|
|
|
|
<literal>(a|b\1)+</literal>
|
|
|
|
matches any number of "a"s and also "aba", "ababba" etc. At
|
|
each iteration of the subpattern, the back reference matches
|
|
the character string corresponding to the previous iteration.
|
|
In order for this to work, the pattern must be such
|
|
that the first iteration does not need to match the back
|
|
reference. This can be done using alternation, as in the
|
|
example above, or by a quantifier with a minimum of zero.
|
|
</para>
|
|
<para>
|
|
As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be
|
|
used for absolute and relative referencing of subpatterns.
|
|
This escape sequence must be followed by an unsigned number or a negative
|
|
number, optionally enclosed in braces. The sequences <literal>\1</literal>,
|
|
<literal>\g1</literal> and <literal>\g{1}</literal> are synonymous
|
|
with one another. The use of this pattern with an unsigned number can
|
|
help remove the ambiguity inherent when using digits following a
|
|
backslash. The sequence helps to distinguish back references from octal
|
|
characters and also makes it easier to have a back reference followed
|
|
by a literal number, e.g. <literal>\g{2}1</literal>.
|
|
</para>
|
|
<para>
|
|
The use of the <literal>\g</literal> sequence with a negative number
|
|
signifies a relative reference. For example, <literal>(foo)(bar)\g{-1}</literal>
|
|
would match the sequence "foobarbar" and <literal>(foo)(bar)\g{-2}</literal>
|
|
matches "foobarfoo". This can be useful in long patterns as an alternative
|
|
to keeping track of the number of subpatterns in order to reference
|
|
a specific previous subpattern.
|
|
</para>
|
|
<para>
|
|
Back references to the named subpatterns can be achieved by
|
|
<literal>(?P=name)</literal> or, since PHP 5.2.2, also by
|
|
<literal>\k<name></literal> or <literal>\k'name'</literal>.
|
|
Additionally PHP 5.2.4 added support for <literal>\k{name}</literal>
|
|
and <literal>\g{name}</literal>.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.assertions">
|
|
<title>Assertions</title>
|
|
<para>
|
|
An assertion is a test on the characters following or
|
|
preceding the current matching point that does not actually
|
|
consume any characters. The simple assertions coded as \b,
|
|
\B, \A, \Z, \z, ^ and $ are described above. More complicated
|
|
assertions are coded as subpatterns. There are two
|
|
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
|
|
subject string, and those that <emphasis>look behind</emphasis> it.
|
|
</para>
|
|
<para>
|
|
An assertion subpattern is matched in the normal way, except
|
|
that it does not cause the current matching position to be
|
|
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
|
|
assertions and (?! for negative assertions. For example,
|
|
|
|
<literal>\w+(?=;)</literal>
|
|
|
|
matches a word followed by a semicolon, but does not include
|
|
the semicolon in the match, and
|
|
|
|
<literal>foo(?!bar)</literal>
|
|
|
|
matches any occurrence of "foo" that is not followed by
|
|
"bar". Note that the apparently similar pattern
|
|
|
|
<literal>(?!foo)bar</literal>
|
|
|
|
does not find an occurrence of "bar" that is preceded by
|
|
something other than "foo"; it finds any occurrence of "bar"
|
|
whatsoever, because the assertion (?!foo) is always &true;
|
|
when the next three characters are "bar". A lookbehind
|
|
assertion is needed to achieve this effect.
|
|
</para>
|
|
<para>
|
|
<emphasis>Lookbehind</emphasis> assertions start with (?<= for positive assertions
|
|
and (?<! for negative assertions. For example,
|
|
|
|
<literal>(?<!foo)bar</literal>
|
|
|
|
does find an occurrence of "bar" that is not preceded by
|
|
"foo". The contents of a lookbehind assertion are restricted
|
|
such that all the strings it matches must have a fixed
|
|
length. However, if there are several alternatives, they do
|
|
not all have to have the same fixed length. Thus
|
|
|
|
<literal>(?<=bullock|donkey)</literal>
|
|
|
|
is permitted, but
|
|
|
|
<literal>(?<!dogs?|cats?)</literal>
|
|
|
|
causes an error at compile time. Branches that match different
|
|
length strings are permitted only at the top level of
|
|
a lookbehind assertion. This is an extension compared with
|
|
Perl 5.005, which requires all branches to match the same
|
|
length of string. An assertion such as
|
|
|
|
<literal>(?<=ab(c|de))</literal>
|
|
|
|
is not permitted, because its single top-level branch can
|
|
match two different lengths, but it is acceptable if rewritten
|
|
to use two top-level branches:
|
|
|
|
<literal>(?<=abc|abde)</literal>
|
|
|
|
The implementation of lookbehind assertions is, for each
|
|
alternative, to temporarily move the current position back
|
|
by the fixed width and then try to match. If there are
|
|
insufficient characters before the current position, the
|
|
match is deemed to fail. Lookbehinds in conjunction with
|
|
once-only subpatterns can be particularly useful for matching
|
|
at the ends of strings; an example is given at the end
|
|
of the section on once-only subpatterns.
|
|
</para>
|
|
<para>
|
|
Several assertions (of any sort) may occur in succession.
|
|
For example,
|
|
|
|
<literal>(?<=\d{3})(?<!999)foo</literal>
|
|
|
|
matches "foo" preceded by three digits that are not "999".
|
|
Notice that each of the assertions is applied independently
|
|
at the same point in the subject string. First there is a
|
|
check that the previous three characters are all digits,
|
|
then there is a check that the same three characters are not
|
|
"999". This pattern does not match "foo" preceded by six
|
|
characters, the first of which are digits and the last three
|
|
of which are not "999". For example, it doesn't match
|
|
"123abcfoo". A pattern to do that is
|
|
|
|
<literal>(?<=\d{3}...)(?<!999)foo</literal>
|
|
</para>
|
|
<para>
|
|
This time the first assertion looks at the preceding six
|
|
characters, checking that the first three are digits, and
|
|
then the second assertion checks that the preceding three
|
|
characters are not "999".
|
|
</para>
|
|
<para>
|
|
Assertions can be nested in any combination. For example,
|
|
|
|
<literal>(?<=(?<!foo)bar)baz</literal>
|
|
|
|
matches an occurrence of "baz" that is preceded by "bar"
|
|
which in turn is not preceded by "foo", while
|
|
|
|
<literal>(?<=\d{3}...(?<!999))foo</literal>
|
|
|
|
is another pattern which matches "foo" preceded by three
|
|
digits and any three characters that are not "999".
|
|
</para>
|
|
<para>
|
|
Assertion subpatterns are not capturing subpatterns, and may
|
|
not be repeated, because it makes no sense to assert the
|
|
same thing several times. If any kind of assertion contains
|
|
capturing subpatterns within it, these are counted for the
|
|
purposes of numbering the capturing subpatterns in the whole
|
|
pattern. However, substring capturing is carried out only
|
|
for positive assertions, because it does not make sense for
|
|
negative assertions.
|
|
</para>
|
|
<para>
|
|
Assertions count towards the maximum of 200 parenthesized
|
|
subpatterns.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.onlyonce">
|
|
<title>Once-only subpatterns</title>
|
|
<para>
|
|
With both maximizing and minimizing repetition, failure of
|
|
what follows normally causes the repeated item to be
|
|
re-evaluated to see if a different number of repeats allows the
|
|
rest of the pattern to match. Sometimes it is useful to
|
|
prevent this, either to change the nature of the match, or
|
|
to cause it fail earlier than it otherwise might, when the
|
|
author of the pattern knows there is no point in carrying
|
|
on.
|
|
</para>
|
|
<para>
|
|
Consider, for example, the pattern \d+foo when applied to
|
|
the subject line
|
|
|
|
<literal>123456bar</literal>
|
|
</para>
|
|
<para>
|
|
After matching all 6 digits and then failing to match "foo",
|
|
the normal action of the matcher is to try again with only 5
|
|
digits matching the \d+ item, and then with 4, and so on,
|
|
before ultimately failing. Once-only subpatterns provide the
|
|
means for specifying that once a portion of the pattern has
|
|
matched, it is not to be re-evaluated in this way, so the
|
|
matcher would give up immediately on failing to match "foo"
|
|
the first time. The notation is another kind of special
|
|
parenthesis, starting with (?> as in this example:
|
|
|
|
<literal>(?>\d+)bar</literal>
|
|
</para>
|
|
<para>
|
|
This kind of parenthesis "locks up" the part of the pattern
|
|
it contains once it has matched, and a failure further into
|
|
the pattern is prevented from backtracking into it.
|
|
Backtracking past it to previous items, however, works as normal.
|
|
</para>
|
|
<para>
|
|
An alternative description is that a subpattern of this type
|
|
matches the string of characters that an identical standalone
|
|
pattern would match, if anchored at the current point
|
|
in the subject string.
|
|
</para>
|
|
<para>
|
|
Once-only subpatterns are not capturing subpatterns. Simple
|
|
cases such as the above example can be thought of as a maximizing
|
|
repeat that must swallow everything it can. So,
|
|
while both \d+ and \d+? are prepared to adjust the number of
|
|
digits they match in order to make the rest of the pattern
|
|
match, (?>\d+) can only match an entire sequence of digits.
|
|
</para>
|
|
<para>
|
|
This construction can of course contain arbitrarily complicated
|
|
subpatterns, and it can be nested.
|
|
</para>
|
|
<para>
|
|
Once-only subpatterns can be used in conjunction with
|
|
lookbehind assertions to specify efficient matching at the end
|
|
of the subject string. Consider a simple pattern such as
|
|
|
|
<literal>abcd$</literal>
|
|
|
|
when applied to a long string which does not match. Because
|
|
matching proceeds from left to right, PCRE will look for
|
|
each "a" in the subject and then see if what follows matches
|
|
the rest of the pattern. If the pattern is specified as
|
|
|
|
<literal>^.*abcd$</literal>
|
|
|
|
then the initial .* matches the entire string at first, but
|
|
when this fails (because there is no following "a"), it
|
|
backtracks to match all but the last character, then all but
|
|
the last two characters, and so on. Once again the search
|
|
for "a" covers the entire string, from right to left, so we
|
|
are no better off. However, if the pattern is written as
|
|
|
|
<literal>^(?>.*)(?<=abcd)</literal>
|
|
|
|
then there can be no backtracking for the .* item; it can
|
|
match only the entire string. The subsequent lookbehind
|
|
assertion does a single test on the last four characters. If
|
|
it fails, the match fails immediately. For long strings,
|
|
this approach makes a significant difference to the processing time.
|
|
</para>
|
|
<para>
|
|
When a pattern contains an unlimited repeat inside a subpattern
|
|
that can itself be repeated an unlimited number of
|
|
times, the use of a once-only subpattern is the only way to
|
|
avoid some failing matches taking a very long time indeed.
|
|
The pattern
|
|
|
|
<literal>(\D+|<\d+>)*[!?]</literal>
|
|
|
|
matches an unlimited number of substrings that either consist
|
|
of non-digits, or digits enclosed in <>, followed by
|
|
either ! or ?. When it matches, it runs quickly. However, if
|
|
it is applied to
|
|
|
|
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
|
|
|
|
it takes a long time before reporting failure. This is
|
|
because the string can be divided between the two repeats in
|
|
a large number of ways, and all have to be tried. (The example
|
|
used [!?] rather than a single character at the end,
|
|
because both PCRE and Perl have an optimization that allows
|
|
for fast failure when a single character is used. They
|
|
remember the last single character that is required for a
|
|
match, and fail early if it is not present in the string.)
|
|
If the pattern is changed to
|
|
|
|
<literal>((?>\D+)|<\d+>)*[!?]</literal>
|
|
|
|
sequences of non-digits cannot be broken, and failure happens quickly.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.conditional">
|
|
<title>Conditional subpatterns</title>
|
|
<para>
|
|
It is possible to cause the matching process to obey a subpattern
|
|
conditionally or to choose between two alternative
|
|
subpatterns, depending on the result of an assertion, or
|
|
whether a previous capturing subpattern matched or not. The
|
|
two possible forms of conditional subpattern are
|
|
</para>
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
(?(condition)yes-pattern)
|
|
(?(condition)yes-pattern|no-pattern)
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
<para>
|
|
If the condition is satisfied, the yes-pattern is used; otherwise
|
|
the no-pattern (if present) is used. If there are
|
|
more than two alternatives in the subpattern, a compile-time
|
|
error occurs.
|
|
</para>
|
|
<para>
|
|
There are two kinds of condition. If the text between the
|
|
parentheses consists of a sequence of digits, then the
|
|
condition is satisfied if the capturing subpattern of that
|
|
number has previously matched. Consider the following pattern,
|
|
which contains non-significant white space to make it
|
|
more readable (assume the <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
|
|
option) and to divide it into three parts for ease of discussion:
|
|
|
|
<literal>( \( )? [^()]+ (?(1) \) )</literal>
|
|
</para>
|
|
<para>
|
|
The first part matches an optional opening parenthesis, and
|
|
if that character is present, sets it as the first captured
|
|
substring. The second part matches one or more characters
|
|
that are not parentheses. The third part is a conditional
|
|
subpattern that tests whether the first set of parentheses
|
|
matched or not. If they did, that is, if subject started
|
|
with an opening parenthesis, the condition is &true;, and so
|
|
the yes-pattern is executed and a closing parenthesis is
|
|
required. Otherwise, since no-pattern is not present, the
|
|
subpattern matches nothing. In other words, this pattern
|
|
matches a sequence of non-parentheses, optionally enclosed
|
|
in parentheses.
|
|
</para>
|
|
<para>
|
|
If the condition is the string <literal>(R)</literal>, it is satisfied if
|
|
a recursive call to the pattern or subpattern has been made. At "top
|
|
level", the condition is false.
|
|
</para>
|
|
<para>
|
|
If the condition is not a sequence of digits or (R), it must be an
|
|
assertion. This may be a positive or negative lookahead or
|
|
lookbehind assertion. Consider this pattern, again containing
|
|
non-significant white space, and with the two alternatives on
|
|
the second line:
|
|
</para>
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
<![CDATA[
|
|
(?(?=[^a-z]*[a-z])
|
|
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
|
]]>
|
|
</programlisting>
|
|
</informalexample>
|
|
<para>
|
|
The condition is a positive lookahead assertion that matches
|
|
an optional sequence of non-letters followed by a letter. In
|
|
other words, it tests for the presence of at least one
|
|
letter in the subject. If a letter is found, the subject is
|
|
matched against the first alternative; otherwise it is
|
|
matched against the second. This pattern matches strings in
|
|
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
|
|
letters and dd are digits.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.comments">
|
|
<title>Comments</title>
|
|
<para>
|
|
The sequence (?# marks the start of a comment which
|
|
continues up to the next closing parenthesis. Nested
|
|
parentheses are not permitted. The characters that make up a
|
|
comment play no part in the pattern matching at all.
|
|
</para>
|
|
<para>
|
|
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
|
|
option is set, an unescaped # character outside a character class
|
|
introduces a comment that continues up to the next newline character
|
|
in the pattern.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.recursive">
|
|
<title>Recursive patterns</title>
|
|
<para>
|
|
Consider the problem of matching a string in parentheses,
|
|
allowing for unlimited nested parentheses. Without the use
|
|
of recursion, the best that can be done is to use a pattern
|
|
that matches up to some fixed depth of nesting. It is not
|
|
possible to handle an arbitrary nesting depth. Perl 5.6 has
|
|
provided an experimental facility that allows regular
|
|
expressions to recurse (among other things). The special
|
|
item (?R) is provided for the specific case of recursion.
|
|
This PCRE pattern solves the parentheses problem (assume
|
|
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
|
|
option is set so that white space is
|
|
ignored):
|
|
|
|
<literal>\( ( (?>[^()]+) | (?R) )* \)</literal>
|
|
</para>
|
|
<para>
|
|
First it matches an opening parenthesis. Then it matches any
|
|
number of substrings which can either be a sequence of
|
|
non-parentheses, or a recursive match of the pattern itself
|
|
(i.e. a correctly parenthesized substring). Finally there is
|
|
a closing parenthesis.
|
|
</para>
|
|
<para>
|
|
This particular example pattern contains nested unlimited
|
|
repeats, and so the use of a once-only subpattern for matching
|
|
strings of non-parentheses is important when applying
|
|
the pattern to strings that do not match. For example, when
|
|
it is applied to
|
|
|
|
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
|
|
|
|
it yields "no match" quickly. However, if a once-only subpattern
|
|
is not used, the match runs for a very long time
|
|
indeed because there are so many different ways the + and *
|
|
repeats can carve up the subject, and all have to be tested
|
|
before failure can be reported.
|
|
</para>
|
|
<para>
|
|
The values set for any capturing subpatterns are those from
|
|
the outermost level of the recursion at which the subpattern
|
|
value is set. If the pattern above is matched against
|
|
|
|
<literal>(ab(cd)ef)</literal>
|
|
|
|
the value for the capturing parentheses is "ef", which is
|
|
the last value taken on at the top level. If additional
|
|
parentheses are added, giving
|
|
|
|
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
|
|
then the string they capture
|
|
is "ab(cd)ef", the contents of the top level parentheses. If
|
|
there are more than 15 capturing parentheses in a pattern,
|
|
PCRE has to obtain extra memory to store data during a
|
|
recursion, which it does by using pcre_malloc, freeing it
|
|
via pcre_free afterwards. If no memory can be obtained, it
|
|
saves data for the first 15 capturing parentheses only, as
|
|
there is no way to give an out-of-memory error from within a
|
|
recursion.
|
|
</para>
|
|
|
|
<para>
|
|
<literal>(?1)</literal>, <literal>(?2)</literal> and so on
|
|
can be used for recursive subpatterns too. It is also possible to use named
|
|
subpatterns: <literal>(?P>name)</literal> or
|
|
<literal>(?&name)</literal>.
|
|
</para>
|
|
<para>
|
|
If the syntax for a recursive subpattern reference (either by number or
|
|
by name) is used outside the parentheses to which it refers, it operates
|
|
like a subroutine in a programming language. An earlier example
|
|
pointed out that the pattern
|
|
<literal>(sens|respons)e and \1ibility</literal>
|
|
matches "sense and sensibility" and "response and responsibility", but
|
|
not "sense and responsibility". If instead the pattern
|
|
<literal>(sens|respons)e and (?1)ibility</literal>
|
|
is used, it does match "sense and responsibility" as well as the other
|
|
two strings. Such references must, however, follow the subpattern to
|
|
which they refer.
|
|
</para>
|
|
|
|
<para>
|
|
The maximum length of a subject string is the largest positive number
|
|
that an integer variable can hold. However, PCRE uses recursion to
|
|
handle subpatterns and indefinite repetition. This means that the
|
|
available stack space may limit the size of a subject string that can be
|
|
processed by certain patterns.
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section xml:id="regexp.reference.performance">
|
|
<title>Performance</title>
|
|
<para>
|
|
Certain items that may appear in patterns are more efficient
|
|
than others. It is more efficient to use a character class
|
|
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
|
|
In general, the simplest construction that provides the
|
|
required behaviour is usually the most efficient. Jeffrey
|
|
Friedl's book contains a lot of discussion about optimizing
|
|
regular expressions for efficient performance.
|
|
</para>
|
|
<para>
|
|
When a pattern begins with .* and the <link
|
|
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
|
|
set, the pattern is implicitly anchored by PCRE, since it
|
|
can match only at the start of a subject string. However, if
|
|
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
|
|
is not set, PCRE cannot make this optimization,
|
|
because the . metacharacter does not then match a newline,
|
|
and if the subject string contains newlines, the pattern may
|
|
match from the character immediately following one of them
|
|
instead of from the very start. For example, the pattern
|
|
|
|
<literal>(.*) second</literal>
|
|
|
|
matches the subject "first\nand second" (where \n stands for
|
|
a newline character) with the first captured substring being
|
|
"and". In order to do this, PCRE has to retry the match
|
|
starting after every newline in the subject.
|
|
</para>
|
|
<para>
|
|
If you are using such a pattern with subject strings that do
|
|
not contain newlines, the best performance is obtained by
|
|
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
|
|
or starting the pattern with ^.* to
|
|
indicate explicit anchoring. That saves PCRE from having to
|
|
scan along the subject looking for a newline to restart at.
|
|
</para>
|
|
<para>
|
|
Beware of patterns that contain nested indefinite repeats.
|
|
These can take a long time to run when applied to a string
|
|
that does not match. Consider the pattern fragment
|
|
|
|
<literal>(a+)*</literal>
|
|
</para>
|
|
<para>
|
|
This can match "aaaa" in 33 different ways, and this number
|
|
increases very rapidly as the string gets longer. (The *
|
|
repeat can match 0, 1, 2, 3, or 4 times, and for each of
|
|
those cases other than 0, the + repeats can match different
|
|
numbers of times.) When the remainder of the pattern is such
|
|
that the entire match is going to fail, PCRE has in principle
|
|
to try every possible variation, and this can take an
|
|
extremely long time.
|
|
</para>
|
|
<para>
|
|
An optimization catches some of the more simple cases such
|
|
as
|
|
|
|
<literal>(a+)*b</literal>
|
|
|
|
where a literal character follows. Before embarking on the
|
|
standard matching procedure, PCRE checks that there is a "b"
|
|
later in the subject string, and if there is not, it fails
|
|
the match immediately. However, when there is no following
|
|
literal this optimization cannot be used. You can see the
|
|
difference by comparing the behaviour of
|
|
|
|
<literal>(a+)*\d</literal>
|
|
|
|
with the pattern above. The former gives a failure almost
|
|
instantly when applied to a whole line of "a" characters,
|
|
whereas the latter takes an appreciable time with strings
|
|
longer than about 20 characters.
|
|
</para>
|
|
</section>
|
|
</chapter>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode: sgml
|
|
sgml-omittag:t
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
indent-tabs-mode:nil
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"~/.phpdoc/manual.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:nil
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
vim600: syn=xml fen fdm=syntax fdl=2 si
|
|
vim: et tw=78 syn=sgml
|
|
vi: ts=1 sw=1
|
|
-->
|