mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-15 16:38:54 +00:00

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@9864 c90b9560-bf6c-de11-be94-00142212c4b1
1558 lines
63 KiB
Text
1558 lines
63 KiB
Text
<reference id="ref.pcre">
|
|
<title>Perl-compatible Regular Expression functions</title>
|
|
<titleabbrev>PCRE</titleabbrev>
|
|
|
|
<partintro>
|
|
<para>
|
|
The syntax for patterns used in these functions closely resembles
|
|
Perl. The expression should be enclosed in the delimiters, a
|
|
forward slash (/), for example. Any character can be used for
|
|
delimiter as long as it's not alphanumeric or backslash (\). If
|
|
the delimiter character has to be used in the expression itself,
|
|
it needs to be escaped by backslash.
|
|
|
|
<para>
|
|
The ending delimiter may be followed by various options that
|
|
affect the matching.
|
|
See <link linkend="pcre.pattern.options">Pattern Options</link>.
|
|
|
|
<para>
|
|
<example>
|
|
<title>Examples of valid patterns</title>
|
|
<itemizedlist>
|
|
<listitem><simpara>/<\/\w+>/</listitem>
|
|
<listitem><simpara>|(\d{3})-\d+|Sm</listitem>
|
|
<listitem><simpara>/^(?i)php[34]/</listitem>
|
|
</itemizedlist>
|
|
</example>
|
|
|
|
<para>
|
|
<example>
|
|
<title>Examples of invalid patterns</title>
|
|
<itemizedlist>
|
|
<listitem><simpara>/href='(.*)' - missing ending delimiter</listitem>
|
|
<listitem><simpara>/\w+\s*\w+/J - unknown option 'J'</listitem>
|
|
<listitem><simpara>1-\d3-\d3-\d4| - missing starting delimiter
|
|
</listitem>
|
|
</itemizedlist>
|
|
</example>
|
|
</partintro>
|
|
|
|
<refentry id="function.preg-match">
|
|
<refnamediv>
|
|
<refname>preg_match</refname>
|
|
<refpurpose>Perform a regular expression match</refpurpose>
|
|
</refnamediv>
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<funcsynopsis>
|
|
<funcdef>int <function>preg_match</function></funcdef>
|
|
<paramdef>string <parameter>pattern</parameter></paramdef>
|
|
<paramdef>string <parameter>subject</parameter></paramdef>
|
|
<paramdef>array <parameter><optional>matches</optional></parameter></paramdef>
|
|
</funcsynopsis>
|
|
<para>
|
|
Searches <parameter>subject</parameter> for a match to the regular
|
|
expression given in <parameter>pattern</parameter>.
|
|
|
|
<para>
|
|
If <parameter>matches</parameter> is provided, then it is filled
|
|
with the results of search. $matches[0] will contain the text that
|
|
match the full pattern, $matches[1] will have the text that matched
|
|
the first captured parenthesized subpattern, and so on.
|
|
|
|
<para>
|
|
Returns true if a match for <parameter>pattern</parameter> was
|
|
found in the subject string, or false if not match was found
|
|
or an error occurred.
|
|
|
|
<para>
|
|
<example>
|
|
<title>Getting the page number out of a string</title>
|
|
<programlisting>
|
|
if (preg_match("/page\s+#(\d+)/i", "Go to page #9.", $parts))
|
|
print "Next page is $parts[1]";
|
|
else
|
|
print "Page not found.";
|
|
</programlisting>
|
|
</example>
|
|
|
|
See also <function>preg_match_all</function>,
|
|
<function>preg_replace</function>, and
|
|
<function>preg_split</function>.
|
|
</refsect1>
|
|
</refentry>
|
|
|
|
<refentry id="function.preg-match-all">
|
|
<refnamediv>
|
|
<refname>preg_match_all</refname>
|
|
<refpurpose>Perform a global regular expression match</refpurpose>
|
|
</refnamediv>
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<funcsynopsis>
|
|
<funcdef>int <function>preg_match_all</function></funcdef>
|
|
<paramdef>string <parameter>pattern</parameter></paramdef>
|
|
<paramdef>string <parameter>subject</parameter></paramdef>
|
|
<paramdef>array <parameter>matches</parameter></paramdef>
|
|
<paramdef>int <parameter><optional>order</optional></parameter></paramdef>
|
|
</funcsynopsis>
|
|
<para>
|
|
Searches <parameter>subject</parameter> for all matches to the regular
|
|
expression given in <parameter>pattern</parameter> and puts them in
|
|
<parameter>matches</parameter> in the order specified by
|
|
<parameter>order</parameter>.
|
|
|
|
<para>
|
|
After the first match is found, the subsequent searches are continued
|
|
on from end of the last match.
|
|
|
|
<para>
|
|
<parameter>order</parameter> can be one of two things:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>PREG_PATTERN_ORDER</term>
|
|
<listitem>
|
|
<para>
|
|
Orders results so that $matches[0] is an array of full
|
|
pattern matches, $matches[1] is an array of strings matched by
|
|
the first parenthesized subpattern, and so on.
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
preg_match_all("|<[^>]+>(.*)</[^>]+>|U", "<b>example: </b><div align=left>a test</div>", $out, PREG_PATTERN_ORDER);
|
|
print $out[0][0].", ".$out[0][1]."\n";
|
|
print $out[1][0].", ".$out[1][1]."\n"
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
This example will produce:
|
|
<informalexample>
|
|
<programlisting>
|
|
<b>example: </b>, <div align=left>this is a test</div>
|
|
example: , this is a test
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
So, $out[0] contains array of strings that matched full pattern,
|
|
and $out[1] contains array of strings enclosed by tags.
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>PREG_SET_ORDER</term>
|
|
<listitem>
|
|
<para>
|
|
Orders results so that $matches[0] is an array of first set
|
|
of matches, $matches[1] is an array of second set of matches,
|
|
and so on.
|
|
|
|
<informalexample>
|
|
<programlisting>
|
|
preg_match_all("|<[^>]+>(.*)</[^>]+>|U", "<b>example: </b><div align=left>a test</div>", $out, PREG_SET_ORDER);
|
|
print $out[0][0].", ".$out[0][1]."\n";
|
|
print $out[1][0].", ".$out[1][1]."\n"
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
This example will produce:
|
|
<informalexample>
|
|
<programlisting>
|
|
<b>example: </b>, example:
|
|
<div align=left>this is a test</div>, this is a test
|
|
</programlisting>
|
|
</informalexample>
|
|
|
|
In this case, $matches[0] is the first set of matches, and
|
|
$matches[0][0] has text matched by full pattern, $matches[0][1]
|
|
has text matched by first subpattern and so on. Similarly,
|
|
$matches[1] is the second set of matches, etc.
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<para>
|
|
If <parameter>order</parameter> is not specified, it is assumed
|
|
to be PREG_PATTERN_ORDER.
|
|
|
|
<para>
|
|
Returns the number of full pattern matches, or false if
|
|
no match is found or an error occurred.
|
|
|
|
<para>
|
|
<example>
|
|
<title>Getting all phone numbers out of some text.</title>
|
|
<programlisting>
|
|
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
|
|
"Call 555-1212 or 1-800-555-1212", $phones);
|
|
</programlisting>
|
|
</example>
|
|
|
|
<simpara>
|
|
See also <function>preg_match</function>,
|
|
<function>preg_replace</function>,
|
|
and <function>preg_split</function>.
|
|
</refsect1>
|
|
</refentry>
|
|
|
|
<refentry id="function.preg-replace">
|
|
<refnamediv>
|
|
<refname>preg_replace</refname>
|
|
<refpurpose>Perform a regular expression search and replace</refpurpose>
|
|
</refnamediv>
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<funcsynopsis>
|
|
<funcdef>mixed <function>preg_replace</function></funcdef>
|
|
<paramdef>mixed <parameter>pattern</parameter></paramdef>
|
|
<paramdef>mixed <parameter>replacement</parameter></paramdef>
|
|
<paramdef>mixed <parameter>subject</parameter></paramdef>
|
|
</funcsynopsis>
|
|
<para>
|
|
Searches <parameter>subject</parameter> for matches to <parameter>
|
|
pattern</parameter> and replaces them with <parameter>replacement
|
|
</parameter>.
|
|
|
|
<para>
|
|
<parameter>replacement</parameter> may contain references of the form
|
|
<literal>\\<replaceable>n</replaceable></literal>. Every such
|
|
reference will be replaced by the text captured by the
|
|
<replaceable>n</replaceable>'th parenthesized pattern. <replaceable>n
|
|
</replaceable>can be from 0 to 99, and <literal>\\0</literal> refers to
|
|
the text matched by the whole pattern. Opening parentheses are
|
|
counted from left to right (starting from 1) to obtain the number
|
|
of the capturing subpattern.
|
|
|
|
<para>
|
|
If no matches are found in <parameter>subject</parameter>, then
|
|
it will be returned unchanged.
|
|
|
|
<para>
|
|
Every parameter to <function>preg_replace</function> can be an array.
|
|
|
|
<para>
|
|
If <parameter>subject</parameter> is an array, then the search and
|
|
replace is performed on every entry of <parameter>subject</parameter>,
|
|
and the return value is an array as well.
|
|
|
|
<para>
|
|
If <parameter>pattern</parameter> and <parameter>replacement</parameter>
|
|
are arrays, then <function>preg_replace</function> takes a value from
|
|
each array and uses them to do search and replace on
|
|
<parameter>subject</parameter>. If <parameter>replacement</parameter>
|
|
has fewer values than <parameter>pattern</parameter>, then empty string
|
|
is used for the rest of replacement values. If <parameter>pattern
|
|
</parameter> is an array and <parameter>replacement</parameter> is a
|
|
string; then this replacement string is used for every value of
|
|
<parameter>pattern</parameter>. The converse would not make sense,
|
|
though.
|
|
|
|
<para>
|
|
<example>
|
|
<title>Replacing several values</title>
|
|
<programlisting>
|
|
$patterns = array("/(19|20\d{2})-(\d{1,2})-(\d{1,2})/", "/^\s*{(\w+)}\s*=/");
|
|
$replace = array("\\3/\\4/\\1", "$\\1 =");
|
|
print preg_replace($patterns, $replace, "{startDate} = 1999-5-27");
|
|
</programlisting>
|
|
</example>
|
|
|
|
This example will produce:
|
|
|
|
<programlisting>
|
|
$startDate = 5/27/1999
|
|
</programlisting>
|
|
|
|
See also <function>preg_match</function>,
|
|
<function>preg_match_all</function>, and
|
|
<function>preg_split</function>.
|
|
</refsect1>
|
|
</refentry>
|
|
|
|
<refentry id="function.preg-split">
|
|
<refnamediv>
|
|
<refname>preg_split</refname>
|
|
<refpurpose>Split string by a regular expression</refpurpose>
|
|
</refnamediv>
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<funcsynopsis>
|
|
<funcdef>array preg_split</funcdef>
|
|
<paramdef>string <parameter>pattern</parameter></paramdef>
|
|
<paramdef>string <parameter>subject</parameter></paramdef>
|
|
<paramdef>int <parameter><optional>limit</optional></parameter></paramdef>
|
|
</funcsynopsis>
|
|
<para>
|
|
Returns an array containing substrings of <parameter>subject</parameter>
|
|
split along boundaries matched by <parameter>pattern</parameter>.
|
|
|
|
<para>
|
|
If <parameter>limit</parameter> is specified, then only substrings
|
|
up to <parameter>limit</parameter> are returned.
|
|
|
|
<para>
|
|
<example>
|
|
<title>Getting parts of search string</title>
|
|
<programlisting>
|
|
$keywords = preg_split("/[\s,]+/", "hypertext language, programming");
|
|
</programlisting>
|
|
</example>
|
|
|
|
See also <function>preg_match</function>,
|
|
<function>preg_match_all</function>, and
|
|
<function>preg_replace</function>.
|
|
</refsect1>
|
|
</refentry>
|
|
|
|
<refentry id="pcre.pattern.options">
|
|
<refnamediv>
|
|
<refname>Pattern Options</refname>
|
|
<refpurpose>describes possible options in regex
|
|
patterns</refpurpose>
|
|
</refnamediv>
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<para>
|
|
The current possible PCRE options are listed below. The names in
|
|
parentheses refer to internal PCRE names for these options.
|
|
|
|
<para>
|
|
<blockquote>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><emphasis>i</emphasis> (PCRE_CASELESS)</term>
|
|
<listitem>
|
|
<simpara>
|
|
If this option is set, letters in the pattern match both
|
|
upper and lower case letters.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>m</emphasis> (PCRE_MULTILINE)</term>
|
|
<listitem>
|
|
<simpara>
|
|
By default, PCRE treats the subject string as consisting of a
|
|
single "line" of characters (even if it actually contains
|
|
several newlines). The "start of line" metacharacter (^)
|
|
matches only at the start of the string, while the "end of
|
|
line" metacharacter ($) matches only at the end of the
|
|
string, or before a terminating newline (unless
|
|
<emphasis>E</emphasis> option is set). This is the same as
|
|
Perl.
|
|
|
|
<simpara>
|
|
When this option is set, the "start of line" and "end of
|
|
line" constructs match immediately following or immediately
|
|
before any newline in the subject string, respectively, as
|
|
well as at the very start and end. This is equivalent to
|
|
Perl's /m option. If there are no "\n" characters in a
|
|
subject string, or no occurrences of ^ or $ in a pattern,
|
|
setting this option has no effect.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>s</emphasis> (PCRE_DOTALL)</term>
|
|
<listitem>
|
|
<simpara>
|
|
If this option is set, a dot metacharater in the pattern
|
|
matches all characters, including newlines. Without it,
|
|
newlines are excluded. This option is equivalent to Perl's
|
|
/s option. A negative class such as [^a] always matches a
|
|
newline character, independent of the setting of this
|
|
option.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>x</emphasis> (PCRE_EXTENDED)</term>
|
|
<listitem>
|
|
<simpara>
|
|
If this option is set, whitespace data characters in the
|
|
pattern are totally ignored except when escaped or inside a
|
|
character class, and characters between an unescaped #
|
|
outside a character class and the next newline character,
|
|
inclusive, are also ignored. This is equivalent to Perl's /x
|
|
option, and makes it possible to include comments inside
|
|
complicated patterns. Note, however, that this applies only
|
|
to data characters. Whitespace characters may never appear
|
|
within special character sequences in a pattern, for example
|
|
within the sequence (?( which introduces a conditional
|
|
subpattern.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>A</emphasis> (PCRE_ANCHORED)</term>
|
|
<listitem>
|
|
<simpara>
|
|
If this option is set, the pattern is forced to be
|
|
"anchored", that is, it is constrained to match only at the
|
|
start of the string which is being searched (the "subject
|
|
string"). This effect can also be achieved by appropriate
|
|
constructs in the pattern itself, which is the only way to
|
|
do it in Perl.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>E</emphasis> (PCRE_DOLLAR_ENDONLY)</term>
|
|
<listitem>
|
|
<simpara>
|
|
If this option is set, a dollar metacharacter in the pattern
|
|
matches only at the end of the subject string. Without this
|
|
option, a dollar also matches immediately before the final
|
|
character if it is a newline (but not before any other
|
|
newlines). This option is ignored if <emphasis>m</emphasis>
|
|
option is set. There is no equivalent to this option in
|
|
Perl.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>S</emphasis></term>
|
|
<listitem>
|
|
<simpara>
|
|
When a pattern is going to be used several times, it is
|
|
worth spending more time analyzing it in order to speed up
|
|
the time taken for matching. If this option is set, then
|
|
this extra analysis is performed. At present, studying a
|
|
pattern is useful only for non-anchored patterns that do not
|
|
have a single fixed starting character.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>U</emphasis> (PCRE_UNGREEDY)</term>
|
|
<listitem>
|
|
<simpara>
|
|
This option inverts the "greediness" of the quantifiers so
|
|
that they are not greedy by default, but become greedy if
|
|
followed by "?". It is not compatible with Perl. It can also
|
|
be set by a (?U) option setting within the pattern.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><emphasis>X</emphasis> (PCRE_EXTRA)</term>
|
|
<listitem>
|
|
<simpara>
|
|
This option turns on additional functionality of PCRE that
|
|
is incompatible with Perl. Any backslash in a pattern that
|
|
is followed by a letter that has no special meaning causes
|
|
an error, thus reserving these combinations for future
|
|
expansion. By default, as in Perl, a backslash followed by a
|
|
letter with no special meaning is treated as a literal.
|
|
There are at present no other features controlled by this
|
|
option.
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</blockquote>
|
|
</refsect1>
|
|
</refentry>
|
|
|
|
<refentry id="pcre.pattern.syntax">
|
|
<refnamediv>
|
|
<refname>Pattern Syntax</refname>
|
|
<refpurpose>describes PCRE regex syntax</refpurpose>
|
|
</refnamediv>
|
|
<refsect1>
|
|
<title>Description</title>
|
|
<literallayout>
|
|
The PCRE library is a set of functions that implement regular
|
|
expression pattern matching using the same syntax and semantics
|
|
as Perl 5, with just a few differences (see below). The current
|
|
implementation corresponds to Perl 5.005.
|
|
</literallayout>
|
|
|
|
<refsect1>
|
|
<title>Differences From Perl</title>
|
|
<literallayout>
|
|
The differences described here are with respect to Perl
|
|
5.005.
|
|
|
|
1. By default, a whitespace character is any character that
|
|
the C library function isspace() recognizes, though it is
|
|
possible to compile PCRE with alternative character type
|
|
tables. Normally isspace() matches space, formfeed, newline,
|
|
carriage return, horizontal tab, and vertical tab. Perl 5 no
|
|
longer includes vertical tab in its set of whitespace char-
|
|
acters. The \v escape that was in the Perl documentation for
|
|
a long time was never in fact recognized. However, the char-
|
|
acter itself was treated as whitespace at least up to 5.002.
|
|
In 5.004 and 5.005 it does not match \s.
|
|
|
|
2. PCRE does not allow repeat quantifiers on lookahead
|
|
assertions. Perl permits them, but they do not mean what you
|
|
might think. For example, (?!a){3} does not assert that the
|
|
next three characters are not "a". It just asserts that the
|
|
next character is not "a" three times.
|
|
|
|
3. Capturing subpatterns that occur inside negative looka-
|
|
head assertions are counted, but their entries in the
|
|
offsets vector are never set. Perl sets its numerical vari-
|
|
ables from any such patterns that are matched before the
|
|
assertion fails to match something (thereby succeeding), but
|
|
only if the negative lookahead assertion contains just one
|
|
branch.
|
|
|
|
4. Though binary zero characters are supported in the sub-
|
|
ject string, they are not allowed in a pattern string
|
|
because it is passed as a normal C string, terminated by
|
|
zero. The escape sequence "\0" can be used in the pattern to
|
|
represent a binary zero.
|
|
|
|
5. The following Perl escape sequences are not supported:
|
|
\l, \u, \L, \U, \E, \Q. In fact these are implemented by
|
|
Perl's general string-handling and are not part of its pat-
|
|
tern matching engine.
|
|
|
|
6. The Perl \G assertion is not supported as it is not
|
|
relevant to single pattern matches.
|
|
|
|
7. Fairly obviously, PCRE does not support the (?{code})
|
|
construction.
|
|
|
|
8. There are at the time of writing some oddities in Perl
|
|
5.005_02 concerned with the settings of captured strings
|
|
when part of a pattern is repeated. For example, matching
|
|
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
|
|
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
|
|
unset. However, if the pattern is changed to
|
|
/^(aa(b(b))?)+$/ then $2 (and $3) get set.
|
|
|
|
In Perl 5.004 $2 is set in both cases, and that is also true
|
|
of PCRE. If in the future Perl changes to a consistent state
|
|
that is different, PCRE may change to follow.
|
|
|
|
9. Another as yet unresolved discrepancy is that in Perl
|
|
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
|
|
"a", whereas in PCRE it does not. However, in both Perl and
|
|
PCRE /^(a)?a/ matched against "a" leaves $1 unset.
|
|
|
|
10. PCRE provides some extensions to the Perl regular
|
|
expression facilities:
|
|
|
|
(a) Although lookbehind assertions must match fixed length
|
|
strings, each alternative branch of a lookbehind assertion
|
|
can match a different length of string. Perl 5.005 requires
|
|
them all to have the same length.
|
|
|
|
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
|
|
set, the $ meta- character matches only at the very end of
|
|
the string.
|
|
|
|
(c) If PCRE_EXTRA is set, a backslash followed by a letter
|
|
with no special meaning is faulted.
|
|
|
|
(d) If PCRE_UNGREEDY is set, the greediness of the repeti-
|
|
tion quantifiers is inverted, that is, by default they are
|
|
not greedy, but if followed by a question mark they are.
|
|
</literallayout>
|
|
</refsect1>
|
|
|
|
<refsect1>
|
|
<title>Regular Expression Details</title>
|
|
<literallayout>
|
|
The syntax and semantics of the regular expressions sup-
|
|
ported by PCRE are described below. Regular expressions are
|
|
also described in the Perl documentation and in a number of
|
|
other books, some of which have copious examples. Jeffrey
|
|
Friedl's "Mastering Regular Expressions", published by
|
|
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
|
|
The description here is intended as reference documentation.
|
|
|
|
A regular expression is a pattern that is matched against a
|
|
subject string from left to right. Most characters stand for
|
|
themselves in a pattern, and match the corresponding charac-
|
|
ters in the subject. As a trivial example, the pattern
|
|
|
|
The quick brown fox
|
|
|
|
matches a portion of a subject string that is identical to
|
|
itself. The power of regular expressions comes from the
|
|
ability to include alternatives and repetitions in the pat-
|
|
tern. These are encoded in the pattern by the use of <emphasis>meta</emphasis>-
|
|
<emphasis>characters</emphasis>, which do not stand for themselves but instead
|
|
are interpreted in some special way.
|
|
|
|
There are two different sets of meta-characters: those that
|
|
are recognized anywhere in the pattern except within square
|
|
brackets, and those that are recognized in square brackets.
|
|
Outside square brackets, the meta-characters are as follows:
|
|
|
|
\ general escape character with several uses
|
|
^ assert start of subject (or line, in multiline
|
|
mode)
|
|
$ assert end of subject (or line, in multiline mode)
|
|
. match any character except newline (by default)
|
|
[ start character class definition
|
|
| start of alternative branch
|
|
( start subpattern
|
|
) end subpattern
|
|
? extends the meaning of (
|
|
also 0 or 1 quantifier
|
|
also quantifier minimizer
|
|
* 0 or more quantifier
|
|
+ 1 or more quantifier
|
|
{ start min/max quantifier
|
|
|
|
Part of a pattern that is in square brackets is called a
|
|
"character class". In a character class the only meta-
|
|
characters are:
|
|
|
|
\ general escape character
|
|
^ negate the class, but only if the first character
|
|
- indicates character range
|
|
] terminates the character class
|
|
|
|
The following sections describe the use of each of the
|
|
meta-characters.
|
|
|
|
BACKSLASH
|
|
The backslash character has several uses. Firstly, if it is
|
|
followed by a non-alphameric character, it takes away any
|
|
special meaning that character may have. This use of
|
|
backslash as an escape character applies both inside and
|
|
outside character classes.
|
|
|
|
For example, if you want to match a "*" character, you write
|
|
"\*" in the pattern. This applies whether or not the follow-
|
|
ing character would otherwise be interpreted as a meta-
|
|
character, so it is always safe to precede a non-alphameric
|
|
with "\" to specify that it stands for itself. In particu-
|
|
lar, if you want to match a backslash, you write "\\".
|
|
|
|
If a pattern is compiled with the PCRE_EXTENDED option, whi-
|
|
tespace in the pattern (other than in a character class) and
|
|
characters between a "#" outside a character class and the
|
|
next newline character are ignored. An escaping backslash
|
|
can be used to include a whitespace or "#" character as part
|
|
of the pattern.
|
|
|
|
A second use of backslash provides a way of encoding non-
|
|
printing characters in patterns in a visible manner. There
|
|
is no restriction on the appearance of non-printing charac-
|
|
ters, apart from the binary zero that terminates a pattern,
|
|
but when a pattern is being prepared by text editing, it is
|
|
usually easier to use one of the following escape sequences
|
|
than the binary character it represents:
|
|
|
|
\a alarm, that is, the BEL character (hex 07)
|
|
\cx "control-x", where x is any character
|
|
\e escape (hex 1B)
|
|
\f formfeed (hex 0C)
|
|
\n newline (hex 0A)
|
|
\r carriage return (hex 0D)
|
|
\t tab (hex 09)
|
|
\xhh character with hex code hh
|
|
\ddd character with octal code ddd, or backreference
|
|
|
|
The precise effect of "\cx" is as follows: if "x" is a lower
|
|
case letter, it is converted to upper case. Then bit 6 of
|
|
the character (hex 40) is inverted. Thus "\cz" becomes hex
|
|
1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
|
|
|
|
After "\x", up to two hexadecimal digits are read (letters
|
|
can be in upper or lower case).
|
|
|
|
After "\0" up to two further octal digits are read. In both
|
|
cases, if there are fewer than two digits, just those that
|
|
are present are used. Thus the sequence "\0\x\07" specifies
|
|
two binary zeros followed by a BEL character. Make sure you
|
|
supply two digits after the initial zero if the character
|
|
that follows is itself an octal digit.
|
|
|
|
The handling of a backslash followed by a digit other than 0
|
|
is complicated. Outside a character class, PCRE reads it
|
|
and any following digits as a decimal number. If the number
|
|
is less than 10, or if there have been at least that many
|
|
previous capturing left parentheses in the expression, the
|
|
entire sequence is taken as a <emphasis>back</emphasis> <emphasis>reference</emphasis>. A description
|
|
of how this works is given later, following the discussion
|
|
of parenthesized subpatterns.
|
|
|
|
Inside a character class, or if the decimal number is
|
|
greater than 9 and there have not been that many capturing
|
|
subpatterns, PCRE re-reads up to three octal digits follow-
|
|
ing the backslash, and generates a single byte from the
|
|
least significant 8 bits of the value. Any subsequent digits
|
|
stand for themselves. For example:
|
|
|
|
\040 is another way of writing a space
|
|
\40 is the same, provided there are fewer than 40
|
|
previous capturing subpatterns
|
|
\7 is always a back reference
|
|
\11 might be a back reference, or another way of
|
|
writing a tab
|
|
\011 is always a tab
|
|
\0113 is a tab followed by the character "3"
|
|
\113 is the character with octal code 113 (since there
|
|
can be no more than 99 back references)
|
|
\377 is a byte consisting entirely of 1 bits
|
|
\81 is either a back reference, or a binary zero
|
|
followed by the two characters "8" and "1"
|
|
|
|
Note that octal values of 100 or greater must not be intro-
|
|
duced by a leading zero, because no more than three octal
|
|
digits are ever read.
|
|
|
|
All the sequences that define a single byte value can be
|
|
used both inside and outside character classes. In addition,
|
|
inside a character class, the sequence "\b" is interpreted
|
|
as the backspace character (hex 08). Outside a character
|
|
class it has a different meaning (see below).
|
|
|
|
The third use of backslash is for specifying generic charac-
|
|
ter types:
|
|
|
|
\d any decimal digit
|
|
\D any character that is not a decimal digit
|
|
\s any whitespace character
|
|
\S any character that is not a whitespace character
|
|
\w any "word" character
|
|
\W any "non-word" character
|
|
|
|
Each pair of escape sequences partitions the complete set of
|
|
characters into two disjoint sets. Any given character
|
|
matches one, and only one, of each pair.
|
|
|
|
A "word" character is any letter or digit or the underscore
|
|
character, that is, any character which can be part of a
|
|
Perl "word". The definition of letters and digits is con-
|
|
trolled by PCRE's character tables, and may vary if locale-
|
|
specific matching is taking place (see "Locale support"
|
|
above). For example, in the "fr" (French) locale, some char-
|
|
acter codes greater than 128 are used for accented letters,
|
|
and these are matched by \w.
|
|
|
|
These character type sequences can appear both inside and
|
|
outside character classes. They each match one character of
|
|
the appropriate type. If the current matching point is at
|
|
the end of the subject string, all of them fail, since there
|
|
is no character to match.
|
|
|
|
The fourth use of backslash is for certain simple asser-
|
|
tions. An assertion specifies a condition that has to be met
|
|
at a particular point in a match, without consuming any
|
|
characters from the subject string. The use of subpatterns
|
|
for more complicated assertions is described below. The
|
|
backslashed assertions are
|
|
|
|
\b word boundary
|
|
\B not a word boundary
|
|
\A start of subject (independent of multiline mode)
|
|
\Z end of subject or newline at end (independent of
|
|
multiline mode)
|
|
\z end of subject (independent of multiline mode)
|
|
|
|
These assertions may not appear in character classes (but
|
|
note that "\b" has a different meaning, namely the backspace
|
|
character, inside a character class).
|
|
|
|
A word boundary is a position in the subject string where
|
|
the current character and the previous character do not both
|
|
match \w or \W (i.e. one matches \w and the other matches
|
|
\W), or the start or end of the string if the first or last
|
|
character matches \w, respectively.
|
|
|
|
The \A, \Z, and \z assertions differ from the traditional
|
|
circumflex and dollar (described below) in that they only
|
|
ever match at the very start and end of the subject string,
|
|
whatever options are set. They are not affected by the
|
|
PCRE_NOTBOL or PCRE_NOTEOL options. The difference between
|
|
\Z and \z is that \Z matches before a newline that is the
|
|
last character of the string as well as at the end of the
|
|
string, whereas \z matches only at the end.
|
|
|
|
CIRCUMFLEX AND DOLLAR
|
|
Outside a character class, in the default matching mode, the
|
|
circumflex character is an assertion which is true only if
|
|
the current matching point is at the start of the subject
|
|
string. Inside a character class, circumflex has an entirely
|
|
different meaning (see below).
|
|
|
|
Circumflex need not be the first character of the pattern if
|
|
a number of alternatives are involved, but it should be the
|
|
first thing in each alternative in which it appears if the
|
|
pattern is ever to match that branch. If all possible alter-
|
|
natives start with a circumflex, that is, if the pattern is
|
|
constrained to match only at the start of the subject, it is
|
|
said to be an "anchored" pattern. (There are also other con-
|
|
structs that can cause a pattern to be anchored.)
|
|
|
|
A dollar character is an assertion which is true only if the
|
|
current matching point is at the end of the subject string,
|
|
or immediately before a newline character that is the last
|
|
character in the string (by default). Dollar need not be the
|
|
last character of the pattern if a number of alternatives
|
|
are involved, but it should be the last item in any branch
|
|
in which it appears. Dollar has no special meaning in a
|
|
character class.
|
|
|
|
The meaning of dollar can be changed so that it matches only
|
|
at the very end of the string, by setting the
|
|
PCRE_DOLLAR_ENDONLY option at compile or matching time. This
|
|
does not affect the \Z assertion.
|
|
|
|
The meanings of the circumflex and dollar characters are
|
|
changed if the PCRE_MULTILINE option is set. When this is
|
|
the case, they match immediately after and immediately
|
|
before an internal "\n" character, respectively, in addition
|
|
to matching at the start and end of the subject string. For
|
|
example, the pattern /^abc$/ matches the subject string
|
|
"def\nabc" in multiline mode, but not otherwise. Conse-
|
|
quently, patterns that are anchored in single line mode
|
|
because all branches start with "^" are not anchored in mul-
|
|
tiline mode. The PCRE_DOLLAR_ENDONLY option is ignored if
|
|
PCRE_MULTILINE is set.
|
|
|
|
Note that the sequences \A, \Z, and \z can be used to match
|
|
the start and end of the subject in both modes, and if all
|
|
branches of a pattern start with \A is it always anchored,
|
|
whether PCRE_MULTILINE is set or not.
|
|
|
|
|
|
|
|
FULL STOP (PERIOD, DOT)
|
|
Outside a character class, a dot in the pattern matches any
|
|
one character in the subject, including a non-printing
|
|
character, but not (by default) newline. If the PCRE_DOTALL
|
|
option is set, then dots match newlines as well. The han-
|
|
dling of dot is entirely independent of the handling of cir-
|
|
cumflex and dollar, the only relationship being that they
|
|
both involve newline characters. Dot has no special meaning
|
|
in a character class.
|
|
|
|
|
|
|
|
SQUARE BRACKETS
|
|
An opening square bracket introduces a character class, ter-
|
|
minated by a closing square bracket. A closing square
|
|
bracket on its own is not special. If a closing square
|
|
bracket is required as a member of the class, it should be
|
|
the first data character in the class (after an initial cir-
|
|
cumflex, if present) or escaped with a backslash.
|
|
|
|
A character class matches a single character in the subject;
|
|
the character must be in the set of characters defined by
|
|
the class, unless the first character in the class is a cir-
|
|
cumflex, in which case the subject character must not be in
|
|
the set defined by the class. If a circumflex is actually
|
|
required as a member of the class, ensure it is not the
|
|
first character, or escape it with a backslash.
|
|
|
|
For example, the character class [aeiou] matches any lower
|
|
case vowel, while [^aeiou] matches any character that is not
|
|
a lower case vowel. Note that a circumflex is just a con-
|
|
venient notation for specifying the characters which are in
|
|
the class by enumerating those that are not. It is not an
|
|
assertion: it still consumes a character from the subject
|
|
string, and fails if the current pointer is at the end of
|
|
the string.
|
|
|
|
When caseless matching is set, any letters in a class
|
|
represent both their upper case and lower case versions, so
|
|
for example, a caseless [aeiou] matches "A" as well as "a",
|
|
and a caseless [^aeiou] does not match "A", whereas a case-
|
|
ful version would.
|
|
|
|
The newline character is never treated in any special way in
|
|
character classes, whatever the setting of the PCRE_DOTALL
|
|
or PCRE_MULTILINE options is. A class such as [^a] will
|
|
always match a newline.
|
|
|
|
The minus (hyphen) character can be used to specify a range
|
|
of characters in a character class. For example, [d-m]
|
|
matches any letter between d and m, inclusive. If a minus
|
|
character is required in a class, it must be escaped with a
|
|
backslash or appear in a position where it cannot be inter-
|
|
preted as indicating a range, typically as the first or last
|
|
character in the class.
|
|
It is not possible to have the literal character "]" as the
|
|
end character of a range. A pattern such as [W-]46] is
|
|
interpreted as a class of two characters ("W" and "-") fol-
|
|
lowed by a literal string "46]", so it would match "W46]" or
|
|
"-46]". However, if the "]" is escaped with a backslash it
|
|
is interpreted as the end of range, so [W-\]46] is inter-
|
|
preted as a single class containing a range followed by two
|
|
separate characters. The octal or hexadecimal representation
|
|
of "]" can also be used to end a range.
|
|
|
|
Ranges operate in ASCII collating sequence. They can also be
|
|
used for characters specified numerically, for example
|
|
[\000-\037]. If a range that includes letters is used when
|
|
caseless matching is set, it matches the letters in either
|
|
case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
|
|
matched caselessly, and if character tables for the "fr"
|
|
locale are in use, [\xc8-\xcb] matches accented E characters
|
|
in both cases.
|
|
|
|
The character types \d, \D, \s, \S, \w, and \W may also
|
|
appear in a character class, and add the characters that
|
|
they match to the class. For example, [\dABCDEF] matches any
|
|
hexadecimal digit. A circumflex can conveniently be used
|
|
with the upper case character types to specify a more res-
|
|
tricted set of characters than the matching lower case type.
|
|
For example, the class [^\W_] matches any letter or digit,
|
|
but not underscore.
|
|
|
|
All non-alphameric characters other than \, -, ^ (at the
|
|
start) and the terminating ] are non-special in character
|
|
classes, but it does no harm if they are escaped.
|
|
|
|
|
|
|
|
VERTICAL BAR
|
|
Vertical bar characters are used to separate alternative
|
|
patterns. For example, the pattern
|
|
|
|
gilbert|sullivan
|
|
|
|
matches either "gilbert" or "sullivan". Any number of alter-
|
|
natives may appear, and an empty alternative is permitted
|
|
(matching the empty string). The matching process tries
|
|
each alternative in turn, from left to right, and the first
|
|
one that succeeds is used. If the alternatives are within a
|
|
subpattern (defined below), "succeeds" means matching the
|
|
rest of the main pattern as well as the alternative in the
|
|
subpattern.
|
|
|
|
|
|
|
|
|
|
INTERNAL OPTION SETTING
|
|
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
|
|
and PCRE_EXTENDED can be changed from within the pattern by
|
|
a sequence of Perl option letters enclosed between "(?" and
|
|
")". The option letters are
|
|
|
|
i for PCRE_CASELESS
|
|
m for PCRE_MULTILINE
|
|
s for PCRE_DOTALL
|
|
x for PCRE_EXTENDED
|
|
|
|
For example, (?im) sets caseless, multiline matching. It is
|
|
also possible to unset these options by preceding the letter
|
|
with a hyphen, and a combined setting and unsetting such as
|
|
(?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
|
|
unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
|
|
If a letter appears both before and after the hyphen, the
|
|
option is unset.
|
|
|
|
The scope of these option changes depends on where in the
|
|
pattern the setting occurs. For settings that are outside
|
|
any subpattern (defined below), the effect is the same as if
|
|
the options were set or unset at the start of matching. The
|
|
following patterns all behave in exactly the same way:
|
|
|
|
(?i)abc
|
|
a(?i)bc
|
|
ab(?i)c
|
|
abc(?i)
|
|
|
|
which in turn is the same as compiling the pattern abc with
|
|
PCRE_CASELESS set. In other words, such "top level" set-
|
|
tings apply to the whole pattern (unless there are other
|
|
changes inside subpatterns). If there is more than one set-
|
|
ting of the same option at top level, the rightmost setting
|
|
is used.
|
|
|
|
If an option change occurs inside a subpattern, the effect
|
|
is different. This is a change of behaviour in Perl 5.005.
|
|
An option change inside a subpattern affects only that part
|
|
of the subpattern that follows it, so
|
|
|
|
(a(?i)b)c
|
|
|
|
matches abc and aBc and no other strings (assuming
|
|
PCRE_CASELESS is not used). By this means, options can be
|
|
made to have different settings in different parts of the
|
|
pattern. Any changes made in one alternative do carry on
|
|
into subsequent branches within the same subpattern. For
|
|
example,
|
|
|
|
(a(?i)b|c)
|
|
|
|
matches "ab", "aB", "c", and "C", even though when matching
|
|
"C" the first branch is abandoned before the option setting.
|
|
This is because the effects of option settings happen at
|
|
compile time. There would be some very weird behaviour oth-
|
|
erwise.
|
|
|
|
The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
|
|
be changed in the same way as the Perl-compatible options by
|
|
using the characters U and X respectively. The (?X) flag
|
|
setting is special in that it must always occur earlier in
|
|
the pattern than any of the additional features it turns on,
|
|
even when it is at top level. It is best put at the start.
|
|
|
|
|
|
|
|
SUBPATTERNS
|
|
Subpatterns are delimited by parentheses (round brackets),
|
|
which can be nested. Marking part of a pattern as a subpat-
|
|
tern does two things:
|
|
|
|
1. It localizes a set of alternatives. For example, the pat-
|
|
tern
|
|
|
|
cat(aract|erpillar|)
|
|
|
|
matches one of the words "cat", "cataract", or "caterpil-
|
|
lar". Without the parentheses, it would match "cataract",
|
|
"erpillar" or the empty string.
|
|
|
|
2. It sets up the subpattern as a capturing subpattern (as
|
|
defined above). When the whole pattern matches, that por-
|
|
tion of the subject string that matched the subpattern is
|
|
passed back to the caller via the <emphasis>ovector</emphasis> argument of
|
|
<function>pcre_exec</function>. Opening parentheses are counted from left to
|
|
right (starting from 1) to obtain the numbers of the captur-
|
|
ing subpatterns.
|
|
|
|
For example, if the string "the red king" is matched against
|
|
the pattern
|
|
|
|
the ((red|white) (king|queen))
|
|
|
|
the captured substrings are "red king", "red", and "king",
|
|
and are numbered 1, 2, and 3.
|
|
|
|
The fact that plain parentheses fulfil two functions is not
|
|
always helpful. There are often times when a grouping sub-
|
|
pattern is required without a capturing requirement. If an
|
|
opening parenthesis is followed by "?:", the subpattern does
|
|
not do any capturing, and is not counted when computing the
|
|
number of any subsequent capturing subpatterns. For example,
|
|
if the string "the white queen" is matched against the
|
|
pattern
|
|
|
|
the ((?:red|white) (king|queen))
|
|
|
|
the captured substrings are "white queen" and "queen", and
|
|
are numbered 1 and 2. The maximum number of captured sub-
|
|
strings is 99, and the maximum number of all subpatterns,
|
|
both capturing and non-capturing, is 200.
|
|
|
|
As a convenient shorthand, if any option settings are
|
|
required at the start of a non-capturing subpattern, the
|
|
option letters may appear between the "?" and the ":". Thus
|
|
the two patterns
|
|
|
|
(?i:saturday|sunday)
|
|
(?:(?i)saturday|sunday)
|
|
|
|
match exactly the same set of strings. Because alternative
|
|
branches are tried from left to right, and options are not
|
|
reset until the end of the subpattern is reached, an option
|
|
setting in one branch does affect subsequent branches, so
|
|
the above patterns match "SUNDAY" as well as "Saturday".
|
|
|
|
|
|
|
|
REPETITION
|
|
Repetition is specified by quantifiers, which can follow any
|
|
of the following items:
|
|
|
|
a single character, possibly escaped
|
|
the . metacharacter
|
|
a character class
|
|
a back reference (see next section)
|
|
a parenthesized subpattern (unless it is an assertion -
|
|
see below)
|
|
|
|
The general repetition quantifier specifies a minimum and
|
|
maximum number of permitted matches, by giving the two
|
|
numbers in curly brackets (braces), separated by a comma.
|
|
The numbers must be less than 65536, and the first must be
|
|
less than or equal to the second. For example:
|
|
|
|
z{2,4}
|
|
|
|
matches "zz", "zzz", or "zzzz". A closing brace on its own
|
|
is not a special character. If the second number is omitted,
|
|
but the comma is present, there is no upper limit; if the
|
|
second number and the comma are both omitted, the quantifier
|
|
specifies an exact number of required matches. Thus
|
|
|
|
[aeiou]{3,}
|
|
|
|
matches at least 3 successive vowels, but may match many
|
|
more, while
|
|
|
|
\d{8}
|
|
|
|
matches exactly 8 digits. An opening curly bracket that
|
|
appears in a position where a quantifier is not allowed, or
|
|
one that does not match the syntax of a quantifier, is taken
|
|
as a literal character. For example, {,6} is not a quantif-
|
|
ier, but a literal string of four characters.
|
|
|
|
The quantifier {0} is permitted, causing the expression to
|
|
behave as if the previous item and the quantifier were not
|
|
present.
|
|
|
|
For convenience (and historical compatibility) the three
|
|
most common quantifiers have single-character abbreviations:
|
|
|
|
* is equivalent to {0,}
|
|
+ is equivalent to {1,}
|
|
? is equivalent to {0,1}
|
|
|
|
It is possible to construct infinite loops by following a
|
|
subpattern that can match no characters with a quantifier
|
|
that has no upper limit, for example:
|
|
|
|
(a?)*
|
|
|
|
Earlier versions of Perl and PCRE used to give an error at
|
|
compile time for such patterns. However, because there are
|
|
cases where this can be useful, such patterns are now
|
|
accepted, but if any repetition of the subpattern does in
|
|
fact match no characters, the loop is forcibly broken.
|
|
|
|
By default, the quantifiers are "greedy", that is, they
|
|
match as much as possible (up to the maximum number of per-
|
|
mitted times), without causing the rest of the pattern to
|
|
fail. The classic example of where this gives problems is in
|
|
trying to match comments in C programs. These appear between
|
|
the sequences /* and */ and within the sequence, individual
|
|
* and / characters may appear. An attempt to match C com-
|
|
ments by applying the pattern
|
|
|
|
/\*.*\*/
|
|
|
|
to the string
|
|
|
|
/* first command */ not comment /* second comment */
|
|
|
|
fails, because it matches the entire string due to the
|
|
greediness of the .* item.
|
|
|
|
However, if a quantifier is followed by a question mark,
|
|
then it ceases to be greedy, and instead matches the minimum
|
|
number of times possible, so the pattern
|
|
|
|
/\*.*?\*/
|
|
|
|
does the right thing with the C comments. The meaning of the
|
|
various quantifiers is not otherwise changed, just the pre-
|
|
ferred number of matches. Do not confuse this use of ques-
|
|
tion mark with its use as a quantifier in its own right.
|
|
Because it has two uses, it can sometimes appear doubled, as
|
|
in
|
|
|
|
\d??\d
|
|
|
|
which matches one digit by preference, but can match two if
|
|
that is the only way the rest of the pattern matches.
|
|
|
|
If the PCRE_UNGREEDY option is set (an option which is not
|
|
available in Perl) then the quantifiers are not greedy by
|
|
default, but individual ones can be made greedy by following
|
|
them with a question mark. In other words, it inverts the
|
|
default behaviour.
|
|
|
|
When a parenthesized subpattern is quantified with a minimum
|
|
repeat count that is greater than 1 or with a limited max-
|
|
imum, more store is required for the compiled pattern, in
|
|
proportion to the size of the minimum or maximum.
|
|
|
|
If a pattern starts with .* or .{0,} and the PCRE_DOTALL
|
|
option (equivalent to Perl's /s) is set, thus allowing the .
|
|
to match newlines, then the pattern is implicitly anchored,
|
|
because whatever follows will be tried against every charac-
|
|
ter position in the subject string, so there is no point in
|
|
retrying the overall match at any position after the first.
|
|
PCRE treats such a pattern as though it were preceded by \A.
|
|
In cases where it is known that the subject string contains
|
|
no newlines, it is worth setting PCRE_DOTALL when the pat-
|
|
tern begins with .* in order to obtain this optimization, or
|
|
alternatively using ^ to indicate anchoring explicitly.
|
|
|
|
When a capturing subpattern is repeated, the value captured
|
|
is the substring that matched the final iteration. For exam-
|
|
ple, after
|
|
|
|
(tweedle[dume]{3}\s*)+
|
|
|
|
has matched "tweedledum tweedledee" the value of the cap-
|
|
tured substring is "tweedledee". However, if there are
|
|
nested capturing subpatterns, the corresponding captured
|
|
values may have been set in previous iterations. For exam-
|
|
ple, after
|
|
/(a|(b))+/
|
|
|
|
matches "aba" the value of the second captured substring is
|
|
"b".
|
|
|
|
|
|
|
|
BACK REFERENCES
|
|
Outside a character class, a backslash followed by a digit
|
|
greater than 0 (and possibly further digits) is a back
|
|
reference to a capturing subpattern earlier (i.e. to its
|
|
left) in the pattern, provided there have been that many
|
|
previous capturing left parentheses.
|
|
|
|
However, if the decimal number following the backslash is
|
|
less than 10, it is always taken as a back reference, and
|
|
causes an error only if there are not that many capturing
|
|
left parentheses in the entire pattern. In other words, the
|
|
parentheses that are referenced need not be to the left of
|
|
the reference for numbers less than 10. See the section
|
|
entitled "Backslash" above for further details of the han-
|
|
dling of digits following a backslash.
|
|
|
|
A back reference matches whatever actually matched the cap-
|
|
turing subpattern in the current subject string, rather than
|
|
anything matching the subpattern itself. So the pattern
|
|
|
|
(sens|respons)e and \1ibility
|
|
|
|
matches "sense and sensibility" and "response and responsi-
|
|
bility", but not "sense and responsibility". If caseful
|
|
matching is in force at the time of the back reference, then
|
|
the case of letters is relevant. For example,
|
|
|
|
((?i)rah)\s+\1
|
|
|
|
matches "rah rah" and "RAH RAH", but not "RAH rah", even
|
|
though the original capturing subpattern is matched case-
|
|
lessly.
|
|
|
|
There may be more than one back reference to the same sub-
|
|
pattern. If a subpattern has not actually been used in a
|
|
particular match, then any back references to it always
|
|
fail. For example, the pattern
|
|
|
|
(a|(bc))\2
|
|
|
|
always fails if it starts to match "a" rather than "bc".
|
|
Because there may be up to 99 back references, all digits
|
|
following the backslash are taken as part of a potential
|
|
back reference number. If the pattern continues with a digit
|
|
character, then some delimiter must be used to terminate the
|
|
back reference. If the PCRE_EXTENDED option is set, this can
|
|
be whitespace. Otherwise an empty comment can be used.
|
|
|
|
A back reference that occurs inside the parentheses to which
|
|
it refers fails when the subpattern is first used, so, for
|
|
example, (a\1) never matches. However, such references can
|
|
be useful inside repeated subpatterns. For example, the pat-
|
|
tern
|
|
|
|
(a|b\1)+
|
|
|
|
matches any number of "a"s and also "aba", "ababaa" etc. At
|
|
each iteration of the subpattern, the back reference matches
|
|
the character string corresponding to the previous itera-
|
|
tion. In order for this to work, the pattern must be such
|
|
that the first iteration does not need to match the back
|
|
reference. This can be done using alternation, as in the
|
|
example above, or by a quantifier with a minimum of zero.
|
|
|
|
|
|
|
|
ASSERTIONS
|
|
An assertion is a test on the characters following or
|
|
preceding the current matching point that does not actually
|
|
consume any characters. The simple assertions coded as \b,
|
|
\B, \A, \Z, \z, ^ and $ are described above. More compli-
|
|
cated assertions are coded as subpatterns. There are two
|
|
kinds: those that look ahead of the current position in the
|
|
subject string, and those that look behind it.
|
|
|
|
An assertion subpattern is matched in the normal way, except
|
|
that it does not cause the current matching position to be
|
|
changed. Lookahead assertions start with (?= for positive
|
|
assertions and (?! for negative assertions. For example,
|
|
|
|
\w+(?=;)
|
|
|
|
matches a word followed by a semicolon, but does not include
|
|
the semicolon in the match, and
|
|
|
|
foo(?!bar)
|
|
|
|
matches any occurrence of "foo" that is not followed by
|
|
"bar". Note that the apparently similar pattern
|
|
|
|
(?!foo)bar
|
|
|
|
does not find an occurrence of "bar" that is preceded by
|
|
something other than "foo"; it finds any occurrence of "bar"
|
|
whatsoever, because the assertion (?!foo) is always true
|
|
when the next three characters are "bar". A lookbehind
|
|
assertion is needed to achieve this effect.
|
|
Lookbehind assertions start with (?<= for positive asser-
|
|
tions and (?<! for negative assertions. For example,
|
|
|
|
(?<!foo)bar
|
|
|
|
does find an occurrence of "bar" that is not preceded by
|
|
"foo". The contents of a lookbehind assertion are restricted
|
|
such that all the strings it matches must have a fixed
|
|
length. However, if there are several alternatives, they do
|
|
not all have to have the same fixed length. Thus
|
|
|
|
(?<=bullock|donkey)
|
|
|
|
is permitted, but
|
|
|
|
(?<!dogs?|cats?)
|
|
|
|
causes an error at compile time. Branches that match dif-
|
|
ferent length strings are permitted only at the top level of
|
|
a lookbehind assertion. This is an extension compared with
|
|
Perl 5.005, which requires all branches to match the same
|
|
length of string. An assertion such as
|
|
|
|
(?<=ab(c|de))
|
|
|
|
is not permitted, because its single top-level branch can
|
|
match two different lengths, but it is acceptable if rewrit-
|
|
ten to use two top-level branches:
|
|
|
|
(?<=abc|abde)
|
|
|
|
The implementation of lookbehind assertions is, for each
|
|
alternative, to temporarily move the current position back
|
|
by the fixed width and then try to match. If there are
|
|
insufficient characters before the current position, the
|
|
match is deemed to fail. Lookbehinds in conjunction with
|
|
once-only subpatterns can be particularly useful for match-
|
|
ing at the ends of strings; an example is given at the end
|
|
of the section on once-only subpatterns.
|
|
|
|
Several assertions (of any sort) may occur in succession.
|
|
For example,
|
|
|
|
(?<=\d{3})(?<!999)foo
|
|
|
|
matches "foo" preceded by three digits that are not "999".
|
|
Furthermore, assertions can be nested in any combination.
|
|
For example,
|
|
|
|
(?<=(?<!foo)bar)baz
|
|
|
|
matches an occurrence of "baz" that is preceded by "bar"
|
|
which in turn is not preceded by "foo".
|
|
|
|
Assertion subpatterns are not capturing subpatterns, and may
|
|
not be repeated, because it makes no sense to assert the
|
|
same thing several times. If an assertion contains capturing
|
|
subpatterns within it, these are always counted for the pur-
|
|
poses of numbering the capturing subpatterns in the whole
|
|
pattern. Substring capturing is carried out for positive
|
|
assertions, but it does not make sense for negative asser-
|
|
tions.
|
|
|
|
Assertions count towards the maximum of 200 parenthesized
|
|
subpatterns.
|
|
|
|
|
|
|
|
ONCE-ONLY SUBPATTERNS
|
|
With both maximizing and minimizing repetition, failure of
|
|
what follows normally causes the repeated item to be re-
|
|
evaluated to see if a different number of repeats allows the
|
|
rest of the pattern to match. Sometimes it is useful to
|
|
prevent this, either to change the nature of the match, or
|
|
to cause it fail earlier than it otherwise might, when the
|
|
author of the pattern knows there is no point in carrying
|
|
on.
|
|
|
|
Consider, for example, the pattern \d+foo when applied to
|
|
the subject line
|
|
|
|
123456bar
|
|
|
|
After matching all 6 digits and then failing to match "foo",
|
|
the normal action of the matcher is to try again with only 5
|
|
digits matching the \d+ item, and then with 4, and so on,
|
|
before ultimately failing. Once-only subpatterns provide the
|
|
means for specifying that once a portion of the pattern has
|
|
matched, it is not to be re-evaluated in this way, so the
|
|
matcher would give up immediately on failing to match "foo"
|
|
the first time. The notation is another kind of special
|
|
parenthesis, starting with (?> as in this example:
|
|
|
|
(?>\d+)bar
|
|
|
|
This kind of parenthesis "locks up" the part of the pattern
|
|
it contains once it has matched, and a failure further into
|
|
the pattern is prevented from backtracking into it. Back-
|
|
tracking past it to previous items, however, works as nor-
|
|
mal.
|
|
|
|
An alternative description is that a subpattern of this type
|
|
matches the string of characters that an identical stan-
|
|
dalone pattern would match, if anchored at the current point
|
|
in the subject string.
|
|
|
|
Once-only subpatterns are not capturing subpatterns. Simple
|
|
cases such as the above example can be thought of as a max-
|
|
imizing repeat that must swallow everything it can. So,
|
|
while both \d+ and \d+? are prepared to adjust the number of
|
|
digits they match in order to make the rest of the pattern
|
|
match, (?>\d+) can only match an entire sequence of digits.
|
|
|
|
This construction can of course contain arbitrarily compli-
|
|
cated subpatterns, and it can be nested.
|
|
|
|
Once-only subpatterns can be used in conjunction with look-
|
|
behind assertions to specify efficient matching at the end
|
|
of the subject string. Consider a simple pattern such as
|
|
|
|
abcd$
|
|
|
|
when applied to a long string which does not match it.
|
|
Because matching proceeds from left to right, PCRE will look
|
|
for each "a" in the subject and then see if what follows
|
|
matches the rest of the pattern. If the pattern is specified
|
|
as
|
|
|
|
^.*abcd$
|
|
|
|
then the initial .* matches the entire string at first, but
|
|
when this fails, it backtracks to match all but the last
|
|
character, then all but the last two characters, and so on.
|
|
Once again the search for "a" covers the entire string, from
|
|
right to left, so we are no better off. However, if the pat-
|
|
tern is written as
|
|
|
|
^(?>.*)(?<=abcd)
|
|
|
|
then there can be no backtracking for the .* item; it can
|
|
match only the entire string. The subsequent lookbehind
|
|
assertion does a single test on the last four characters. If
|
|
it fails, the match fails immediately. For long strings,
|
|
this approach makes a significant difference to the process-
|
|
ing time.
|
|
|
|
|
|
|
|
CONDITIONAL SUBPATTERNS
|
|
It is possible to cause the matching process to obey a sub-
|
|
pattern conditionally or to choose between two alternative
|
|
subpatterns, depending on the result of an assertion, or
|
|
whether a previous capturing subpattern matched or not. The
|
|
two possible forms of conditional subpattern are
|
|
|
|
(?(condition)yes-pattern)
|
|
(?(condition)yes-pattern|no-pattern)
|
|
|
|
If the condition is satisfied, the yes-pattern is used; oth-
|
|
erwise the no-pattern (if present) is used. If there are
|
|
more than two alternatives in the subpattern, a compile-time
|
|
error occurs.
|
|
|
|
There are two kinds of condition. If the text between the
|
|
parentheses consists of a sequence of digits, then the con-
|
|
dition is satisfied if the capturing subpattern of that
|
|
number has previously matched. Consider the following pat-
|
|
tern, which contains non-significant white space to make it
|
|
more readable (assume the PCRE_EXTENDED option) and to
|
|
divide it into three parts for ease of discussion:
|
|
|
|
( \( )? [^()]+ (?(1) \) )
|
|
|
|
The first part matches an optional opening parenthesis, and
|
|
if that character is present, sets it as the first captured
|
|
substring. The second part matches one or more characters
|
|
that are not parentheses. The third part is a conditional
|
|
subpattern that tests whether the first set of parentheses
|
|
matched or not. If they did, that is, if subject started
|
|
with an opening parenthesis, the condition is true, and so
|
|
the yes-pattern is executed and a closing parenthesis is
|
|
required. Otherwise, since no-pattern is not present, the
|
|
subpattern matches nothing. In other words, this pattern
|
|
matches a sequence of non-parentheses, optionally enclosed
|
|
in parentheses.
|
|
|
|
If the condition is not a sequence of digits, it must be an
|
|
assertion. This may be a positive or negative lookahead or
|
|
lookbehind assertion. Consider this pattern, again contain-
|
|
ing non-significant white space, and with the two alterna-
|
|
tives on the second line:
|
|
|
|
(?(?=[^a-z]*[a-z])
|
|
\d{2}[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
|
|
|
The condition is a positive lookahead assertion that matches
|
|
an optional sequence of non-letters followed by a letter. In
|
|
other words, it tests for the presence of at least one
|
|
letter in the subject. If a letter is found, the subject is
|
|
matched against the first alternative; otherwise it is
|
|
matched against the second. This pattern matches strings in
|
|
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
|
|
letters and dd are digits.
|
|
|
|
|
|
|
|
COMMENTS
|
|
The sequence (?# marks the start of a comment which
|
|
continues up to the next closing parenthesis. Nested
|
|
parentheses are not permitted. The characters that make up a
|
|
comment play no part in the pattern matching at all.
|
|
|
|
If the PCRE_EXTENDED option is set, an unescaped # character
|
|
outside a character class introduces a comment that contin-
|
|
ues up to the next newline character in the pattern.
|
|
|
|
|
|
|
|
PERFORMANCE
|
|
Certain items that may appear in patterns are more efficient
|
|
than others. It is more efficient to use a character class
|
|
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
|
|
In general, the simplest construction that provides the
|
|
required behaviour is usually the most efficient. Jeffrey
|
|
Friedl's book contains a lot of discussion about optimizing
|
|
regular expressions for efficient performance.
|
|
|
|
When a pattern begins with .* and the PCRE_DOTALL option is
|
|
set, the pattern is implicitly anchored by PCRE, since it
|
|
can match only at the start of a subject string. However, if
|
|
PCRE_DOTALL is not set, PCRE cannot make this optimization,
|
|
because the . metacharacter does not then match a newline,
|
|
and if the subject string contains newlines, the pattern may
|
|
match from the character immediately following one of them
|
|
instead of from the very start. For example, the pattern
|
|
|
|
(.*) second
|
|
|
|
matches the subject "first\nand second" (where \n stands for
|
|
a newline character) with the first captured substring being
|
|
"and". In order to do this, PCRE has to retry the match
|
|
starting after every newline in the subject.
|
|
|
|
If you are using such a pattern with subject strings that do
|
|
not contain newlines, the best performance is obtained by
|
|
setting PCRE_DOTALL, or starting the pattern with ^.* to
|
|
indicate explicit anchoring. That saves PCRE from having to
|
|
scan along the subject looking for a newline to restart at.
|
|
</literallayout>
|
|
</refsect1>
|
|
</refentry>
|
|
</reference>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode: sgml
|
|
sgml-omittag:t
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"../manual.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:nil
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
-->
|