From 917ae9d07f014cda94116bf74c9b588b1358170c Mon Sep 17 00:00:00 2001 From: Jakub Vrana Date: Wed, 5 Nov 2008 16:22:25 +0000 Subject: [PATCH] Reorganize pattern syntax (bug #45256) git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@268359 c90b9560-bf6c-de11-be94-00142212c4b1 --- reference/pcre/book.xml | 8 +- reference/pcre/pattern.differences.xml | 156 +++++++++++++++++++++++++ reference/pcre/pattern.syntax.xml | 144 +---------------------- reference/pcre/pattern.xml | 3 +- 4 files changed, 166 insertions(+), 145 deletions(-) create mode 100644 reference/pcre/pattern.differences.xml diff --git a/reference/pcre/book.xml b/reference/pcre/book.xml index 76fc2cd111..b63fbf5818 100644 --- a/reference/pcre/book.xml +++ b/reference/pcre/book.xml @@ -1,5 +1,5 @@ - + @@ -42,6 +42,12 @@ xlink:href="&url.pcre.man;">&url.pcre.man; for more info. + + The PCRE library is a set of functions that implement regular + expression pattern matching using the same syntax and semantics + as Perl 5, with just a few differences (see below). The current + implementation corresponds to Perl 5.005. + &reference.pcre.setup; diff --git a/reference/pcre/pattern.differences.xml b/reference/pcre/pattern.differences.xml new file mode 100644 index 0000000000..8aeff73678 --- /dev/null +++ b/reference/pcre/pattern.differences.xml @@ -0,0 +1,156 @@ + + + +
+ Perl Differences + Differences From Perl + + The differences described here are with respect to Perl 5.005. + + + + By default, a whitespace character is any character that + the C library function isspace() recognizes, though it is + possible to compile PCRE with alternative character type + tables. Normally isspace() matches space, formfeed, newline, + carriage return, horizontal tab, and vertical tab. Perl 5 no + longer includes vertical tab in its set of whitespace characters. + The \v escape that was in the Perl documentation for + a long time was never in fact recognized. However, the character + itself was treated as whitespace at least up to 5.002. + In 5.004 and 5.005 it does not match \s. + + + + + PCRE does not allow repeat quantifiers on lookahead + assertions. Perl permits them, but they do not mean what you + might think. For example, (?!a){3} does not assert that the + next three characters are not "a". It just asserts that the + next character is not "a" three times. + + + + + Capturing subpatterns that occur inside negative + lookahead assertions are counted, but their entries in the + offsets vector are never set. Perl sets its numerical + variables from any such patterns that are matched before the + assertion fails to match something (thereby succeeding), but + only if the negative lookahead assertion contains just one + branch. + + + + + Though binary zero characters are supported in the subject string, + they are not allowed in a pattern string because it is passed as a + normal C string, terminated by zero. The escape sequence "\x00" can + be used in the pattern to represent a binary zero. + + + + + The following Perl escape sequences are not supported: + \l, \u, \L, \U. In fact these are implemented by + Perl's general string-handling and are not part of its + pattern matching engine. + + + + + The Perl \G assertion is not supported as it is not + relevant to single pattern matches. + + + + + Fairly obviously, PCRE does not support the (?{code}) and (??{code}) + construction. However, there is support for recursive patterns. + + + + + There are at the time of writing some oddities in Perl + 5.005_02 concerned with the settings of captured strings + when part of a pattern is repeated. For example, matching + "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value + "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 + unset. However, if the pattern is changed to + /^(aa(b(b))?)+$/ then $2 (and $3) get set. + In Perl 5.004 $2 is set in both cases, and that is also &true; + of PCRE. If in the future Perl changes to a consistent state + that is different, PCRE may change to follow. + + + + + Another as yet unresolved discrepancy is that in Perl + 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string + "a", whereas in PCRE it does not. However, in both Perl and + PCRE /^(a)?a/ matched against "a" leaves $1 unset. + + + + + PCRE provides some extensions to the Perl regular + expression facilities: + + + + Although lookbehind assertions must match fixed length + strings, each alternative branch of a lookbehind assertion + can match a different length of string. Perl 5.005 requires + them all to have the same length. + + + + + If PCRE_DOLLAR_ENDONLY + is set and PCRE_MULTILINE is + not set, the $ meta-character matches only at the very end of the + string. + + + + + If PCRE_EXTRA is + set, a backslash followed by a letter with no special meaning is + faulted. + + + + + If PCRE_UNGREEDY is + set, the greediness of the repetition quantifiers is inverted, + that is, by default they are not greedy, but if followed by a + question mark they are. + + + + + + + +
+ + diff --git a/reference/pcre/pattern.syntax.xml b/reference/pcre/pattern.syntax.xml index cfacc77bc1..824d203176 100644 --- a/reference/pcre/pattern.syntax.xml +++ b/reference/pcre/pattern.syntax.xml @@ -1,152 +1,10 @@ - + Pattern Syntax Describes PCRE regex syntax -
- Description - - The PCRE library is a set of functions that implement regular - expression pattern matching using the same syntax and semantics - as Perl 5, with just a few differences (see below). The current - implementation corresponds to Perl 5.005. - -
- -
- Differences From Perl - - The differences described here are with respect to Perl 5.005. - - - - By default, a whitespace character is any character that - the C library function isspace() recognizes, though it is - possible to compile PCRE with alternative character type - tables. Normally isspace() matches space, formfeed, newline, - carriage return, horizontal tab, and vertical tab. Perl 5 no - longer includes vertical tab in its set of whitespace characters. - The \v escape that was in the Perl documentation for - a long time was never in fact recognized. However, the character - itself was treated as whitespace at least up to 5.002. - In 5.004 and 5.005 it does not match \s. - - - - - PCRE does not allow repeat quantifiers on lookahead - assertions. Perl permits them, but they do not mean what you - might think. For example, (?!a){3} does not assert that the - next three characters are not "a". It just asserts that the - next character is not "a" three times. - - - - - Capturing subpatterns that occur inside negative - lookahead assertions are counted, but their entries in the - offsets vector are never set. Perl sets its numerical - variables from any such patterns that are matched before the - assertion fails to match something (thereby succeeding), but - only if the negative lookahead assertion contains just one - branch. - - - - - Though binary zero characters are supported in the subject string, - they are not allowed in a pattern string because it is passed as a - normal C string, terminated by zero. The escape sequence "\x00" can - be used in the pattern to represent a binary zero. - - - - - The following Perl escape sequences are not supported: - \l, \u, \L, \U. In fact these are implemented by - Perl's general string-handling and are not part of its - pattern matching engine. - - - - - The Perl \G assertion is not supported as it is not - relevant to single pattern matches. - - - - - Fairly obviously, PCRE does not support the (?{code}) and (??{code}) - construction. However, there is support for recursive patterns. - - - - - There are at the time of writing some oddities in Perl - 5.005_02 concerned with the settings of captured strings - when part of a pattern is repeated. For example, matching - "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value - "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 - unset. However, if the pattern is changed to - /^(aa(b(b))?)+$/ then $2 (and $3) get set. - In Perl 5.004 $2 is set in both cases, and that is also &true; - of PCRE. If in the future Perl changes to a consistent state - that is different, PCRE may change to follow. - - - - - Another as yet unresolved discrepancy is that in Perl - 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string - "a", whereas in PCRE it does not. However, in both Perl and - PCRE /^(a)?a/ matched against "a" leaves $1 unset. - - - - - PCRE provides some extensions to the Perl regular - expression facilities: - - - - Although lookbehind assertions must match fixed length - strings, each alternative branch of a lookbehind assertion - can match a different length of string. Perl 5.005 requires - them all to have the same length. - - - - - If PCRE_DOLLAR_ENDONLY - is set and PCRE_MULTILINE is - not set, the $ meta-character matches only at the very end of the - string. - - - - - If PCRE_EXTRA is - set, a backslash followed by a letter with no special meaning is - faulted. - - - - - If PCRE_UNGREEDY is - set, the greediness of the repetition quantifiers is inverted, - that is, by default they are not greedy, but if followed by a - question mark they are. - - - - - - - -
-
Regular Expression Details
diff --git a/reference/pcre/pattern.xml b/reference/pcre/pattern.xml index 4243b65d03..8d3fe1b282 100644 --- a/reference/pcre/pattern.xml +++ b/reference/pcre/pattern.xml @@ -1,10 +1,11 @@ - + PCRE Patterns &reference.pcre.pattern.modifiers; + &reference.pcre.pattern.differences; &reference.pcre.pattern.syntax;