From 2d24777fae845ba001be4d5aba3d6d60787b2ff7 Mon Sep 17 00:00:00 2001 From: Jakub Vrana Date: Fri, 17 Jun 2005 11:40:21 +0000 Subject: [PATCH] PCRE 5.0 git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@188618 c90b9560-bf6c-de11-be94-00142212c4b1 --- reference/pcre/pattern.modifiers.xml | 4 +- reference/pcre/pattern.syntax.xml | 165 +++++++++++++++++++++++++-- 2 files changed, 161 insertions(+), 8 deletions(-) diff --git a/reference/pcre/pattern.modifiers.xml b/reference/pcre/pattern.modifiers.xml index a7c8c6d92a..4ad58a1816 100644 --- a/reference/pcre/pattern.modifiers.xml +++ b/reference/pcre/pattern.modifiers.xml @@ -1,5 +1,5 @@ - + @@ -12,6 +12,7 @@ The current possible PCRE modifiers are listed below. The names in parentheses refer to internal PCRE names for these modifiers. + Spaces and newlines are ignored in modifiers, other characters cause error.
@@ -179,6 +180,7 @@ is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. + UTF-8 validity of the pattern is checked since PHP 4.3.5. diff --git a/reference/pcre/pattern.syntax.xml b/reference/pcre/pattern.syntax.xml index 9e958831f7..eb3181f050 100644 --- a/reference/pcre/pattern.syntax.xml +++ b/reference/pcre/pattern.syntax.xml @@ -1,5 +1,5 @@ - + @@ -274,7 +274,7 @@ - backslash + Backslash The backslash character has several uses. Firstly, if it is followed by a non-alphanumeric character, it takes away any @@ -358,6 +358,12 @@ After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case). + In UTF-8 mode, "\x{...}" is + allowed, where the contents of the braces is a string of hexadecimal + digits. It is interpreted as a UTF-8 character whose code number is the + given hexadecimal number. The original hexadecimal escape sequence, + \xhh, matches a two-byte UTF-8 character if the value + is greater than 127. After "\0" up to two further octal digits are read. @@ -545,7 +551,11 @@ \z - end of subject(independent of multiline mode) + end of subject (independent of multiline mode) + + + \G + first matching position in subject @@ -575,6 +585,14 @@ newline that is the last character of the string as well as at the end of the string, whereas \z matches only at the end. + + The \G assertion is true only when the current + matching position is at the start point of the match, as specified by + the offset argument of + preg_match. It differs from \A + when the value of offset is non-zero. + It is available since PHP 4.3.3. + \Q and \E can be used to ignore @@ -586,6 +604,116 @@ + + Unicode character properties + + Since PHP 4.4.0 and 5.1.0, three + additional escape sequences to match generic character types are available + when UTF-8 mode is selected. They are: + + + + \p{xx} + a character with the xx property + + + \P{xx} + a character without the xx property + + + \X + an extended Unicode sequence + + + + The property names represented by xx above are limited to the Unicode + general category properties. Each character has exactly one such + property, specified by a two-letter abbreviation. For compatibility with + Perl, negation can be specified by including a circumflex between the + opening brace and the property name. For example, \p{^Lu} is the same + as \P{Lu}. + + + If only one letter is specified with \p or \P, it includes all the + properties that start with that letter. In this case, in the absence of + negation, the curly brackets in the escape sequence are optional; these + two examples have the same effect: + + + \p{L} + \pL + + + Supported property codes + + + COther + CcControl + CfFormat + CnUnassigned + CoPrivate use + CsSurrogate + LLetter + LlLower case letter + LmModifier letter + LoOther letter + LtTitle case letter + LuUpper case letter + MMark + McSpacing mark + MeEnclosing mark + MnNon-spacing mark + NNumber + NdDecimal number + NlLetter number + NoOther number + PPunctuation + PcConnector punctuation + PdDash punctuation + PeClose punctuation + PfFinal punctuation + PiInitial punctuation + PoOther punctuation + PsOpen punctuation + SSymbol + ScCurrency symbol + SkModifier symbol + SmMathematical symbol + SoOther symbol + ZSeparator + ZlLine separator + ZpParagraph separator + ZsSpace separator + + +
+ + Extended properties such as "Greek" or "InMusicalSymbols" are not + supported by PCRE. + + + Specifying caseless matching does not affect these escape sequences. + For example, \p{Lu} always matches only upper case letters. + + + The \X escape matches any number of Unicode characters that form an + extended Unicode sequence. \X is equivalent to + (?>\PM\pM*). + + + That is, it matches a character without the "mark" property, followed + by zero or more characters with the "mark" property, and treats the + sequence as an atomic group (see below). Characters with the "mark" + property are typically accents that affect the preceding character. + + + Matching characters by Unicode property is not fast, because PCRE has + to search a structure that contains data for over fifteen thousand + characters. That is why the traditional escape sequences such as \d and + \w do not use Unicode properties in PCRE. + +
+ Circumflex and dollar @@ -646,7 +774,7 @@ - FULL STOP + Full stop Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing @@ -658,6 +786,11 @@ both involve newline characters. Dot has no special meaning in a character class. + + \C can be used to match single byte. It makes sense + in UTF-8 mode where full stop matches the whole + character which can consist of multiple bytes. + @@ -862,7 +995,7 @@ - subpatterns + Subpatterns Subpatterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a subpattern @@ -1119,7 +1252,7 @@ - BACK REFERENCES + Back references Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back @@ -1479,7 +1612,12 @@ in parentheses. - If the condition is not a sequence of digits, it must be an + If the condition is the string (R), it is satisfied if + a recursive call to the pattern or subpattern has been made. At "top + level", the condition is false. + + + If the condition is not a sequence of digits or (R), it must be an assertion. This may be a positive or negative lookahead or lookbehind assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on @@ -1585,6 +1723,19 @@ for recursive subpatterns too. It is also possible to use named subpatterns: (?P>foo). + + If the syntax for a recursive subpattern reference (either by number or + by name) is used outside the parentheses to which it refers, it operates + like a subroutine in a programming language. An earlier example + pointed out that the pattern + (sens|respons)e and \1ibility + matches "sense and sensibility" and "response and responsibility", but + not "sense and responsibility". If instead the pattern + (sens|respons)e and (?1)ibility + is used, it does match "sense and responsibility" as well as the other + two strings. Such references must, however, follow the subpattern to + which they refer. +