mirror of
https://github.com/sigmasternchen/php-doc-en
synced 2025-03-15 16:38:54 +00:00
Reorganize pattern syntax (bug #45256)
git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@268359 c90b9560-bf6c-de11-be94-00142212c4b1
This commit is contained in:
parent
3c57c12773
commit
917ae9d07f
4 changed files with 166 additions and 145 deletions
|
@ -1,5 +1,5 @@
|
|||
<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<!-- $Revision: 1.5 $ -->
|
||||
<!-- $Revision: 1.6 $ -->
|
||||
<!-- Purpose: basic.text -->
|
||||
<!-- Membership: bundled -->
|
||||
|
||||
|
@ -42,6 +42,12 @@
|
|||
xlink:href="&url.pcre.man;">&url.pcre.man;</link> for more info.
|
||||
</para>
|
||||
</warning>
|
||||
<para>
|
||||
The PCRE library is a set of functions that implement regular
|
||||
expression pattern matching using the same syntax and semantics
|
||||
as Perl 5, with just a few differences (see below). The current
|
||||
implementation corresponds to Perl 5.005.
|
||||
</para>
|
||||
</preface>
|
||||
|
||||
&reference.pcre.setup;
|
||||
|
|
156
reference/pcre/pattern.differences.xml
Normal file
156
reference/pcre/pattern.differences.xml
Normal file
|
@ -0,0 +1,156 @@
|
|||
<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<!-- $Revision: 1.1 $ -->
|
||||
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
|
||||
<article xml:id="reference.pcre.pattern.differences" xmlns="http://docbook.org/ns/docbook">
|
||||
<title>Perl Differences</title>
|
||||
<titleabbrev>Differences From Perl</titleabbrev>
|
||||
<para>
|
||||
The differences described here are with respect to Perl 5.005.
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<simpara>
|
||||
By default, a whitespace character is any character that
|
||||
the C library function isspace() recognizes, though it is
|
||||
possible to compile PCRE with alternative character type
|
||||
tables. Normally isspace() matches space, formfeed, newline,
|
||||
carriage return, horizontal tab, and vertical tab. Perl 5 no
|
||||
longer includes vertical tab in its set of whitespace characters.
|
||||
The \v escape that was in the Perl documentation for
|
||||
a long time was never in fact recognized. However, the character
|
||||
itself was treated as whitespace at least up to 5.002.
|
||||
In 5.004 and 5.005 it does not match \s.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
PCRE does not allow repeat quantifiers on lookahead
|
||||
assertions. Perl permits them, but they do not mean what you
|
||||
might think. For example, (?!a){3} does not assert that the
|
||||
next three characters are not "a". It just asserts that the
|
||||
next character is not "a" three times.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Capturing subpatterns that occur inside negative
|
||||
lookahead assertions are counted, but their entries in the
|
||||
offsets vector are never set. Perl sets its numerical
|
||||
variables from any such patterns that are matched before the
|
||||
assertion fails to match something (thereby succeeding), but
|
||||
only if the negative lookahead assertion contains just one
|
||||
branch.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Though binary zero characters are supported in the subject string,
|
||||
they are not allowed in a pattern string because it is passed as a
|
||||
normal C string, terminated by zero. The escape sequence "\x00" can
|
||||
be used in the pattern to represent a binary zero.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
The following Perl escape sequences are not supported:
|
||||
\l, \u, \L, \U. In fact these are implemented by
|
||||
Perl's general string-handling and are not part of its
|
||||
pattern matching engine.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
The Perl \G assertion is not supported as it is not
|
||||
relevant to single pattern matches.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Fairly obviously, PCRE does not support the (?{code}) and (??{code})
|
||||
construction. However, there is support for recursive patterns.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
There are at the time of writing some oddities in Perl
|
||||
5.005_02 concerned with the settings of captured strings
|
||||
when part of a pattern is repeated. For example, matching
|
||||
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
|
||||
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
|
||||
unset. However, if the pattern is changed to
|
||||
/^(aa(b(b))?)+$/ then $2 (and $3) get set.
|
||||
In Perl 5.004 $2 is set in both cases, and that is also &true;
|
||||
of PCRE. If in the future Perl changes to a consistent state
|
||||
that is different, PCRE may change to follow.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Another as yet unresolved discrepancy is that in Perl
|
||||
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
|
||||
"a", whereas in PCRE it does not. However, in both Perl and
|
||||
PCRE /^(a)?a/ matched against "a" leaves $1 unset.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
PCRE provides some extensions to the Perl regular
|
||||
expression facilities:
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Although lookbehind assertions must match fixed length
|
||||
strings, each alternative branch of a lookbehind assertion
|
||||
can match a different length of string. Perl 5.005 requires
|
||||
them all to have the same length.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
If <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
||||
is set and <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is
|
||||
not set, the $ meta-character matches only at the very end of the
|
||||
string.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
If <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> is
|
||||
set, a backslash followed by a letter with no special meaning is
|
||||
faulted.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
If <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> is
|
||||
set, the greediness of the repetition quantifiers is inverted,
|
||||
that is, by default they are not greedy, but if followed by a
|
||||
question mark they are.
|
||||
</simpara>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</article>
|
||||
|
||||
<!-- Keep this comment at the end of the file
|
||||
Local variables:
|
||||
mode: sgml
|
||||
sgml-omittag:t
|
||||
sgml-shorttag:t
|
||||
sgml-minimize-attributes:nil
|
||||
sgml-always-quote-attributes:t
|
||||
sgml-indent-step:1
|
||||
sgml-indent-data:t
|
||||
indent-tabs-mode:nil
|
||||
sgml-parent-document:nil
|
||||
sgml-default-dtd-file:"../../../../manual.ced"
|
||||
sgml-exposed-tags:nil
|
||||
sgml-local-catalogs:nil
|
||||
sgml-local-ecat-files:nil
|
||||
End:
|
||||
vim600: syn=xml fen fdm=syntax fdl=2 si
|
||||
vim: et tw=78 syn=sgml
|
||||
vi: ts=1 sw=1
|
||||
-->
|
|
@ -1,152 +1,10 @@
|
|||
<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<!-- $Revision: 1.20 $ -->
|
||||
<!-- $Revision: 1.21 $ -->
|
||||
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
|
||||
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">
|
||||
<title>Pattern Syntax</title>
|
||||
<titleabbrev>Describes PCRE regex syntax</titleabbrev>
|
||||
|
||||
<section xml:id="pcre.pattern.syntax.description">
|
||||
<title>Description</title>
|
||||
<simpara>
|
||||
The PCRE library is a set of functions that implement regular
|
||||
expression pattern matching using the same syntax and semantics
|
||||
as Perl 5, with just a few differences (see below). The current
|
||||
implementation corresponds to Perl 5.005.
|
||||
</simpara>
|
||||
</section>
|
||||
|
||||
<section xml:id="pcre.pattern.syntax.differences">
|
||||
<title>Differences From Perl</title>
|
||||
<para>
|
||||
The differences described here are with respect to Perl 5.005.
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<simpara>
|
||||
By default, a whitespace character is any character that
|
||||
the C library function isspace() recognizes, though it is
|
||||
possible to compile PCRE with alternative character type
|
||||
tables. Normally isspace() matches space, formfeed, newline,
|
||||
carriage return, horizontal tab, and vertical tab. Perl 5 no
|
||||
longer includes vertical tab in its set of whitespace characters.
|
||||
The \v escape that was in the Perl documentation for
|
||||
a long time was never in fact recognized. However, the character
|
||||
itself was treated as whitespace at least up to 5.002.
|
||||
In 5.004 and 5.005 it does not match \s.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
PCRE does not allow repeat quantifiers on lookahead
|
||||
assertions. Perl permits them, but they do not mean what you
|
||||
might think. For example, (?!a){3} does not assert that the
|
||||
next three characters are not "a". It just asserts that the
|
||||
next character is not "a" three times.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Capturing subpatterns that occur inside negative
|
||||
lookahead assertions are counted, but their entries in the
|
||||
offsets vector are never set. Perl sets its numerical
|
||||
variables from any such patterns that are matched before the
|
||||
assertion fails to match something (thereby succeeding), but
|
||||
only if the negative lookahead assertion contains just one
|
||||
branch.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Though binary zero characters are supported in the subject string,
|
||||
they are not allowed in a pattern string because it is passed as a
|
||||
normal C string, terminated by zero. The escape sequence "\x00" can
|
||||
be used in the pattern to represent a binary zero.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
The following Perl escape sequences are not supported:
|
||||
\l, \u, \L, \U. In fact these are implemented by
|
||||
Perl's general string-handling and are not part of its
|
||||
pattern matching engine.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
The Perl \G assertion is not supported as it is not
|
||||
relevant to single pattern matches.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Fairly obviously, PCRE does not support the (?{code}) and (??{code})
|
||||
construction. However, there is support for recursive patterns.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
There are at the time of writing some oddities in Perl
|
||||
5.005_02 concerned with the settings of captured strings
|
||||
when part of a pattern is repeated. For example, matching
|
||||
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
|
||||
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
|
||||
unset. However, if the pattern is changed to
|
||||
/^(aa(b(b))?)+$/ then $2 (and $3) get set.
|
||||
In Perl 5.004 $2 is set in both cases, and that is also &true;
|
||||
of PCRE. If in the future Perl changes to a consistent state
|
||||
that is different, PCRE may change to follow.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Another as yet unresolved discrepancy is that in Perl
|
||||
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
|
||||
"a", whereas in PCRE it does not. However, in both Perl and
|
||||
PCRE /^(a)?a/ matched against "a" leaves $1 unset.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
PCRE provides some extensions to the Perl regular
|
||||
expression facilities:
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<simpara>
|
||||
Although lookbehind assertions must match fixed length
|
||||
strings, each alternative branch of a lookbehind assertion
|
||||
can match a different length of string. Perl 5.005 requires
|
||||
them all to have the same length.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
If <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
|
||||
is set and <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is
|
||||
not set, the $ meta-character matches only at the very end of the
|
||||
string.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
If <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> is
|
||||
set, a backslash followed by a letter with no special meaning is
|
||||
faulted.
|
||||
</simpara>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<simpara>
|
||||
If <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> is
|
||||
set, the greediness of the repetition quantifiers is inverted,
|
||||
that is, by default they are not greedy, but if followed by a
|
||||
question mark they are.
|
||||
</simpara>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="regexp.reference">
|
||||
<title>Regular Expression Details</title>
|
||||
<section xml:id="regexp.introduction">
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
<?xml version="1.0" encoding="iso-8859-1"?>
|
||||
<!-- $Revision: 1.2 $ -->
|
||||
<!-- $Revision: 1.3 $ -->
|
||||
|
||||
<part xml:id="pcre.pattern">
|
||||
<title>PCRE Patterns</title>
|
||||
|
||||
&reference.pcre.pattern.modifiers;
|
||||
&reference.pcre.pattern.differences;
|
||||
&reference.pcre.pattern.syntax;
|
||||
|
||||
</part>
|
||||
|
|
Loading…
Reference in a new issue